Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Episode 13 :  Enhancing System Resilience and Reliability through SRE image

Episode 13 : Enhancing System Resilience and Reliability through SRE

Observability Talk
Avatar
70 Plays3 months ago

In this insightful conversation, Ravindra Harish, Head of SRE , Kotak Mahindra Bank, shares his journey from software engineering to site reliability engineering (SRE). He discusses the challenges of shifting from traditional operations to an SRE mindset, why reliability should be treated as a feature, and the evolving role of observability in modern infrastructure. The discussion also explores key SRE principles, including toil reduction, incident management, and the importance of business-centric observability. This episode is a must-listen for anyone looking to build or optimize an SRE practice. 

Recommended
Transcript

Communication Challenges in Tech

00:00:00
Speaker
The constant ping from various different people who are not very much into technology in your corporate world. It could be product, it could be business owners, it could be the leaders, etc. etc They're knocking on your door. It could be L2, L3, SRE, engineers.
00:00:16
Speaker
It amplifies a more ah structured type of a solution to all those people where your work, your're you're actually reducing the toil by building the right appropriate dashboards as well. So there is a component of dashboard which also solves the toil part which we can populate around.
00:00:43
Speaker
Hi,

Introduction to Ravindra

00:00:44
Speaker
welcome to a new episode of Observability Talk. Today, I am excited to welcome Ravindra, our special guest on the Observability Talk.
00:00:54
Speaker
Ravindra is an experienced technology leader with over 20 years of experience leading key operations of cloud engineering and services, automation, observability, toil reduction, and site reliability engineering.
00:01:08
Speaker
He currently heads SRE practice at Kotak for their mobile platforms. And before that, he was at Nike and Infosys. I met him last year when he created an ASAR event in Nike and got together industry practitioners as well as Nike software engineers to talk about toil reduction, chaos engineering, ASAR and observability.
00:01:33
Speaker
Although I was not on a speaker list, he agreed to let me talk about business journey observability and how we at VueNet is the customer journeys together. Hey Ravindra, thank you so much for joining us today.
00:01:48
Speaker
Bharat, thanks for having me. This is a great opportunity day and um I'm extremely excited to spend some time with you. Great.

Ravindra's Career Journey

00:01:56
Speaker
ah Ravinder, to start with, you have been leading lot of technology teams, right? And then moved on to become a program manager.
00:02:06
Speaker
And then you put your focus onto the site reliability engineering. And now you are heading a function, ah the SRE function at a large bank. Can you talk about your journey and how did you end up ah leading SRE?
00:02:20
Speaker
And what what really excites you oh about this ah whole SRE practice? So, I mean, that's a fantastic question. i think this is where probably i want to put a little bit of a phrase of finding your own niche probably helps you to expect it in the area what you want. I think it's a little bit applicable for that. And when I explain a little bit of a story to that, that might make sense. so Like anybody, I started my career being a software engineer.
00:02:46
Speaker
um I was at that point in time, EQL was a most data-waring was an extreme level of ah ah requirement. ah What happened was when I already bought into a project, I literally started struggling when it came to scratch scratch programming.
00:03:02
Speaker
Anything to be built from the ground up was was not managed at all. ah But incidentally, it so happened that one of the couple of weeks I was doing on-call, ah the analytical skills sort of ticked in. um and I figured out finding out the faults in the program or finding out the faults in the

Transition to SRE

00:03:18
Speaker
execution level and the performance was something which sort of interested me. And I must say, as where I stand today in a sorry world, there is no looking back.
00:03:26
Speaker
ah So being a software engineer, the life really moved on, ultimately being ah much into the operations side of the world. um i've I've done l one sort of thing, but quickly I realized that I had to move to L2 and L3, whatever the terminology is, however, multiple companies are doing this.
00:03:41
Speaker
So my my ladder was on to that. I worked into the East Coast of the US for logistic type of a company. which showcased how difficult the world of operations could be, both ah from a business standpoint and a technical standpoint.
00:03:54
Speaker
And eventually, the journey ended up in Nike, as you know, in the West Coast of the US. So I literally spent around close to 14 years with Nike, including India when I came back in 2022.
00:04:06
Speaker
But the interesting fact, when we were really sort of into the operation layer, in 2016 is when Nike sort of decided that, ah look, let's have a dedicated SRT. um And that, Jimmy, somewhere in our portion we can also talk about is and it becomes a perfect mix of what should be from an operation side and what should be a core developer into the mix of SRT.
00:04:27
Speaker
So fortunately, I was selected to sort of lead and build a team onto that SRE at that point in time. And that's how the journey of really SRE started off because Nike decided and dedicated their manpower, effort and whatnot to follow the Google practice of putting the mindset, what we call do the operations with a software engineering mindset.
00:04:47
Speaker
So that's where everything really started off. And here we are 2025 in the world of Quotack and sort of trying to apply the same principles once more in my life.

Mindset Shift to SRE

00:04:58
Speaker
And how easy it was for you to actually transition into that mindset of SRE where you need to apply some of those software engineering principles to reliability?
00:05:09
Speaker
Oh, um extremely difficult. I'll tell you this. mean, it's not easy at all because your brain is wired in dealing with operation problems completely in an operational way because you want to escape from the pain because you're getting constant calls saying that did we solve the problem? No, not the problem.
00:05:27
Speaker
The mindset is to be solved with all the issue right So that's not... I'm going to say it's difficult not just only for me. It's a surrounding environment as well who have already seen you as an operational type of a guy or a team or members who are sort of known to sort of get receive the on-call, restart the something, go and update the database or do whatever it takes to get back to the like the optimal operating type of pin and really unoperate.
00:05:53
Speaker
Putting that software engineering mindset to solve a problem like, okay, I saw these these issues three times. Now let me convert it to a problem. Let me convert this into a Jira.
00:06:04
Speaker
Let me bring in a software, a sprint planning to it so that I never want to see this three repeating ultimately. Really requires a conscious awareness. Let's put it that way. It's not about your skills.
00:06:16
Speaker
It's more of your psychological aspect of let me take a breath. Let me solve this issue. But let me not get away from it. Let me solve this permanently. And that's why the concept of DevOps, where putting this pressure on engineers directly really helps out because engineers being the smart people, they never want to see the CEO again because they're tired. they don't want to wake up in the night and et cetera, et cetera, right?
00:06:37
Speaker
So my answer and a conclusion to you is... It takes effort. um It's not very straightforward to adopt the concept of SRE and do it as it software engineering mindset. When you have dealt for a very long period, specifically in bigger companies and whatnot, where you have a dedicated old one who are being constantly restarting services for you, or there is a team which is constantly monitoring alerts for you.
00:06:57
Speaker
Yes, I'll stop there. Yeah, and that is very interesting. ah ah I think third or fourth episode, we talked to Safir. I think you met him in one of the Hasgeek sessions, right? Yes, yes. And he made a very interesting point. And he was ah like really very of sort of emphatic about saying this, ah that people have to start looking at reliability as a feature.
00:07:20
Speaker
nice rather than afterthought or trying to figure out, okay, I have my application built, now how do I go and put reliability around it? ah Do you have any ah thoughts on that?

Integrating Reliability Early

00:07:34
Speaker
What do you think about such ah such sentence like reliability should be a feature of your application?
00:07:40
Speaker
I have two answers for that. One is i want to supplement his statement with one more statement of Mai, which you have personally heard me telling multiple times. And this, whenever I introduce myself and and I say my title, my next immediate line is, I'm from SRE, but I'm not from production support.
00:07:57
Speaker
and i mean by that i personally experienced that a typical mindset in various different organizations is a old wine in a new bottle type of concept so production support has been labeled somehow as a sari so that's not true point number one point number two ah is aligning to the statement what has been said by you is reliability is one other practice that exactly similar to your software development lifecycle.
00:08:25
Speaker
So when you start your software development lifecycle with the product, which is ultimately collecting the requirements from your business partners and whatnot, now there originates the process and the journey of SRE there itself.
00:08:37
Speaker
SRE is not at the end when you're decided to go like, and that's one of the most common mistakes which we all do because we're in a hurry of collecting the requirement and finishing the functional requirements. Realty has never been onto the parallel, but I'm very happy to announce the mindset and the culture has been changing.
00:08:53
Speaker
I mean, specifically where I am standing now with Kotai, the way we have been hired and we are being forced and merged into the stream of product development, SRE is running parallel to them.
00:09:03
Speaker
So my important two points are SRE is not a software engineering. Second is reliability starts along in parallel or horizontal wherever or in the same starting line when your product really ultimate and not at the end when you're deciding to go operational.
00:09:19
Speaker
I hope I answered your question. Yes, yes, ah very much. So I think we all are in sync that ah ah reliability is one of the very critical component people should start looking at very, very early on, maybe day zero or maybe sometime day minus one.
00:09:32
Speaker
um

Golden Signals in Applications

00:09:34
Speaker
You talked about... that when a problem happens, people definitely want that problem to be solved very quickly. And given the kind of complex infrastructure ah we deal with, right, as an Asari, there are cases where things might be going down and then people want it very, very quickly up.
00:09:54
Speaker
And you would have seen various type of problems in ah scenarios like that, right? As a site reliability engineering expert, what keeps you awake at night? What is it that you are most worried about?
00:10:08
Speaker
is what i learned Yeah, makes sense. I wanted to the sort of put up a punchline out there, which is, ah ideally speaking, as an SRE team, if I am operating as a true SRE in an organization, nothing should keep me awake.
00:10:23
Speaker
um um I mean by that because my job is ideally to prevent any of these type of issues happening. Or my job is to really make sure as a post-mortem of the event that this should never happen again.
00:10:38
Speaker
Type of a mindset, right? So I personally, this is I think many people can challenge me, but my opinion is if there is a night when there's a critical issue happening, I should not be there.
00:10:49
Speaker
and my Because either there is an L1, L2 type of production support, which is to extinguish the fire at that point in time, or the rotational model of DevOps engineers who actually need to be there so that they know and understand what needs to be done immediately to extinguish it.
00:11:03
Speaker
So, in any way, I know this is a debatable topic out there, but my intention is ah assuming that critical issues are happening in the night, I shouldn't be awake in the night. But I get your question. What keeps me worried is something which I would want to answer onto it. That being said,
00:11:18
Speaker
At the end of the day, Bharat, as I say, who's going to guard the guard? SRE is not a miracle which is there somewhere in the sky. SRE as a technical tools and capabilities, what we eventually build, is also sitting in a similar type of a platform in which a functional requirement is running.
00:11:38
Speaker
It could be a cloud platform, it could be a tool, it's a mission at the end of the day. My problem is, if our observability tools is on TOS, At the time when it is really required, and that happens more often than we really can imagine about, it happens.
00:11:52
Speaker
And many times we go to a post-mortem type of session or what we call the AR, after action review, or I think typically it's called root cause analysis and stuff like that. There are times when we have to accept the hatching that, yeah, my alert didn't really kick off at the time when it was supposed to kick off.
00:12:06
Speaker
So that's something which really bothers me all the time. If I'm actually you putting a stack of observability onto a bunch of functional verticals, like because I sit as a horizontal for all the connecting the dots, giving them the dashboards and whatnot.
00:12:20
Speaker
But how will I make sure that my application is setting? Am I doing the rightful reviews and checks and balances, scalability, all these fancy terms, what we talk about scalability, ah then the costing and whatnot. Are we doing that onto our own tool at any point in time is one of the questions.
00:12:37
Speaker
So that that that is one of the challenges I'm being very much facing, even in Qatar, no doubt about it. Then the second one, which I feel is an important challenge is... um In the monolithic, everything was very tightly coupled.
00:12:50
Speaker
ah So it was easy to understand what's happening with. In the microservices, I'm sure you understand, it's all hell loose type of a world out there. Everybody's, as a true engineer, they need to have the independence of their technology, their tech stack, their cloud platform.
00:13:07
Speaker
But we are sitting in the complex world of microservices. For a complete application to work, they all come together very nicely. The world of APIs, you name it. For me as an absurdity, I'm a guy who is supposed to stitch that entirely and give a final report to the world so that either the business, even the CISU type of leaders, or the layman type of an engineer, including the L1 guy, they need to have an holistic view which can tell a common story.
00:13:32
Speaker
But that's the problem. The culture of the microservice themes are different. The tech stacks are different. the ah the the costing and whatnot are different and then comes a legacy part of it where the modern day solutions are not so much flexible to connect to the world of cloud ah to tell me a story of observability so that's one of the challenge which really bothers me all the time like how do we connect end to it so if I have to put it in one word end to end observability is something which really sort of I'll say irritates me all the time because finding the niche success out there, Varath, I'll tell you, it's it's very close to impossible in my opinion.
00:14:10
Speaker
Although I'm sure many people will challenge me on that as well. And the last point, Varath, on to this with respect to what keeps me awake is Again, an alignment with respect to what we see as a rightful metric.
00:14:21
Speaker
You know, I know is when we talk about absurdity for us, the wins are I bring down the NTTD, I bring down the NTTR, I establish the complete CUJs, what we call the consumer user journeys or the critical user journeys.
00:14:34
Speaker
Now, bringing the harmony between the teams, and I want to explain this with an example. You have a front-end team where it is a login functionality and you are probably extreme background for upstream system or whatever you want to put that.
00:14:47
Speaker
If they assume that they are running in batches, for them, their services is not S1 all the time. Whereas anything on the front-end, it's always an S1 if it goes down, right? But your end-to-end Ideally, if you look into CUJ, it transfers itself into an S1 type of case.
00:15:02
Speaker
So if I have to define an SLI and SLO for one entire consumer journey, bringing the relationship and harmony between all the teams is all the same. Basically, SREs ultimately become pretty niche negotiators as well in the world of tech, is my opinion. and So we just don't play the technical role. We play the negotiating type of a role to bring the teams and organizations to together.
00:15:25
Speaker
I'll stop, sir. These are

Promoting SRE Culture

00:15:26
Speaker
um the the the challenges of the modern day of observability. Right. No, I think I can kind of completely agree on all the points.
00:15:37
Speaker
I will just want to ah lo elaborate a little bit on the second point, which talked about the end-to-end part. yeah What I have realized over time, see, we ah at VueNet generally claim that we do end-to-end ah unified observability or business journey observability.
00:15:54
Speaker
And as you rightly pointed out, end-to-end needs to be defined correctly. yes What exactly end-to-end means. Because while we say that we do end-to-end and we do some sort of provider view end-to-end, but it is not at the granular level of transaction, for example.
00:16:13
Speaker
right Purely because there are a lot of shared ah component which is available within the journey. brown Right. It could be a network load balancer, it could be a firewall, it could be a switch, it could be a server which is basically being used by x number of applications.
00:16:30
Speaker
right So you you while you will get the visibility, you can i still get it end-to-end, but you will not be as granular as maybe you are ah in an application server. but like yes yeah But it it is still ah helps in be sort of get a complete picture of your and infrastructure if somebody is able to achieve that end-to-end, whatever is in their environment.
00:16:56
Speaker
ah Couldn't agree more on that, Bharat. You're absolutely right. I think that makes me to sort of amend my own line in the point too. ah When I say about impossibility to this day, um and and I don't mean to say it's a technical challenge, Bharat.
00:17:11
Speaker
umll I'll give you another example. looking you know Say you have 10 microservices where that as a journey travels, you need to understand companies are organized in various different ways. Say probably the front end layer ultimately is under different leadership.
00:17:23
Speaker
The backend, which is a core product, is under a different leadership. Now, when either your product or you look into any other product, data net, app, I mean, Splunk, what really happens is, at the end of the day, it's it's licensing. Don't get me wrong. There is that...
00:17:39
Speaker
certain part of it would sign up and certain part of it would not sign So that's it that where the end comes in because you don't have the visibility later but you're you're not stitched at there.
00:17:50
Speaker
So my point is not Corey with respect to that we have a technical ah incapability at this point but although I do amplify that the technical incapability also exists because If your systems are into the mainframe type of a world underneath at the back end of it, um the ah like the fancy tools, like under the whole mammoths of finacles and stuff like that, right? I mean, penetrating that and integrating with the fancy ah cloud type of solutions today, that is one sort of a technical challenge.
00:18:18
Speaker
But more so is it the business processes and the team structuring and all of these things. And and also how much anybody is invested to really make it happen. or pay So I think it's more of a human problem than the technical problem and in intu today's day of the world.
00:18:33
Speaker
No, no, I completely

Challenges in Observability

00:18:34
Speaker
agree. And we have seen this challenge at multiple customers of ours, right? Where ah either the ah the application is not emitting enough metrics or telemetry, or it was using let's say, C++ plus plus application, right? Which cannot be instrumented, right?
00:18:55
Speaker
And needs to be monitored only through logs. Right. Yes. Yes. So those sort of challenges are definitely there. um I completely agree ah on those parts. um So, Ravindra, you have had a great amount of learning by managing these digital channels, right, of customer facing application and services.
00:19:13
Speaker
Like you were basically looking at Asari for their ah internet facing applications, right, aht websites and and so on. ah If somebody is to pick up SRE to ensure that best customer experience, what should be their top considerations and look for ah if they are now setting up as an SRE practice or they want to bring in SRE practice for ah for making sure that their application services infrastructure is providing the best customer experience?
00:19:46
Speaker
Yeah, great one Bharat. This is very crucial for any of the applications. I mean, that's where, although the question is very specific to consumer facing, I would answer very specific to that part. But in general, before I start with my answer, I would want to make this comment that at the end of the day, as an absurdity, when at my position, when we try to see as a horizontal type of view,
00:20:11
Speaker
our our our mindset is always about ah not differentiating anything at all which sits with the stack of technology within a company because ah there is nothing called not consumer facing at the end of the day because any small service which is written at the extreme back will have a contribution to the consumer facing. That's my general notation but I understand your question so let's go to that particular point. So consumer facing...
00:20:36
Speaker
um i mean I'm sure we are very familiar with this word called golden signals. So that would be the beginning of, not beginning, that would be the not start, what we sort of ah should aim for consumer services. it Because if ah you know in this and the fast-paced type of world, the attention span what we talk about is, there was a time when it was minutes, now it is seconds.
00:20:59
Speaker
if if if the websites like amazon.com doesn't really serve my need at that moment, I lose interest in the overall product at all. Like I go to, I'm searching for my cats, say, little item and And by the time i ah Amazon responds to me with the quick answers, if it's not there, um I move on with like searching something else for my daughter.
00:21:23
Speaker
oh that's the That's the attention span what we have. So that's why golden signals becomes extremely important. ah What do we mean by golden signals for the benefit of our audience? ah the Some of the core metrics which which is applicable from a standard of how well my performer application is performing is the golden metrics.
00:21:40
Speaker
Namely, the availability, latency, throughput errors. There are quite a few, but these are the top three are the availability, latency, and throughput. um That something is a must.
00:21:51
Speaker
ah Now, um ah in consumer, these golden signals by itself can be sliced into multiple options with this. Am I worried to the real-time serving of the front-end type of application, not worrying about the cached information, which is about my transaction information, I can cache it and whatnot.
00:22:08
Speaker
So as long as I can stitch a story in a dashboard format, which shows that my website and my features were 90.99 or whatever is the target, it's available. I am making my consumer happy. So that's the inclination to your question.
00:22:22
Speaker
So golden signals are the most important thing which you focus on. Next is ah synthetic monitoring. ah We have talked in the past as well, which means there might be a time, say one o'clock in the night, when we all sort of, we as an engineers, relax a little bit about the performance of the night, assuming that there's not much traffic.
00:22:40
Speaker
But guess what? one of the best customers of yours might be awake on that night or he might be in a different time zone and he might think our website at 1 o'clock in the night too. And a bad experience still matters at that point in time. So synthetic monitoring is where I artificially induce these type of things so that I get the right alerting to see if it's up or down is synthetic. So in ah this is very much important in a consumer-facing type of observability and application is we need to be up around the clock excluding the maintenance period but that maintenance period has to be managed in a different way which is a different topic and as long as you are considering that the windows is 24.7 and it's working synthetic monitoring becomes the most important thing especially for the features which you really care for example checkout checkout is where the money is really done and consumer is turned into a revenue and whatnot so these are the things so that's synthetic monitoring distributed tracing
00:23:33
Speaker
This, yes or no, I mean, this is debatable, at least to the level where you know that when the consumer comes in and he places an order and we need to give that information back, if I can achieve distributed tracing at least that level, it's a great way.
00:23:45
Speaker
And finally is dashboards. ah i'll So dashboard, I want to take a couple of extra seconds out there. And that is, Bharat, you agree. I hope you'll agree with me. There was a time when dashboards was more used from an, let's have a look. Let's have eyes on the glass to see if there is any spikes.
00:24:03
Speaker
Let's see if there is too much of allergen. But Bharat, I hope again you agree with Today, our monitoring and absurdity type of dashboards, have actually transformed into business dashboards as well.
00:24:14
Speaker
Exactly. i mean, I have seen Seasuit type of leaders ultimately looking up into various different tools like Splunk and Muralik and all, which tells the story of how well are we doing business?
00:24:26
Speaker
um many How is the traffic doing? So it is, I couldn't emphasize more on the dashboards, especially when it comes to consumer facing. As long as you have the right tool for your SRE practices, you are actually using the same tool for not just monitoring, you're actually using for your business metrics as well. You don't need to go buy one more extra tool out there in the market to tell me your analytics of your business usage and stuff.
00:24:47
Speaker
As long as you put your MLTs, which is metrics, logs and traces in an appropriate fashion, C, probably CEO gets his report from him, absurdity type of tool in nowadays.
00:24:58
Speaker
So that's why i amplify in a consumer realm, dashboards, dashboards, dashboards is an extent. People always challenge me like, I mean, why am I so worried? I need alerts. People focus on alerts. That alert is more from an engineering standpoint.
00:25:10
Speaker
They need to go and fix it before it becomes an alert. So no downplaying at all. It's very important. But Dashboard has its own different game ultimately. So that's... And I'll sort of finish off with a couple of more points if it's okay with you. Can I take... Please, please, please.
00:25:24
Speaker
ah Then is scalability and adoptability. Scalability, as you can rightly understand, it's user-consumer-facing. um and ah Traffic can suddenly increase, suddenly drop if I'm not really arranging my scalability, not just from the functional requirement, even from an observability standpoint.
00:25:39
Speaker
Scalability is, i'm mean I mean, my tool it should be in a way to um to collect the metrics, to collect the logs. Suddenly there's a spike as well, right? And this is not fake. This is really which we have observed.
00:25:52
Speaker
At a time of the peak, when our functional solutions were working, our observability has crashed. Right. Because it failed to contain the bridge of the information which is coming into head.
00:26:03
Speaker
And ultimately, when we go and to ah slice and dice, how did we perform or what really happened? Why did this pay? Guess what? We don't have metrics. This is where I would talk about like what keeps me awake in the night? The guards, the guard who guards the guard is actually failing the system.
00:26:17
Speaker
So that that's one part. Just scalability is very, very crucial and very important. And finally, I'll end this answer with like sort of challenges which I can think of, which is ah again, since you said consumer type of facing, it depends on company to company.
00:26:32
Speaker
ah If you're a complete cloud type of platform, your observability implementation would have been easy. So you're good. But if you're from a very big old mammoth of a company which has traditional legacy systems out there, ah ah which which can't really be plugged in with your modern day type of an observability tool,
00:26:50
Speaker
um your end-to-end type of story which we talked about is still a challenge. So it it depends on how well you, I mean, this is the full story, right? So but there is a portion of legacy system which is backend, mainframe, etc.
00:27:04
Speaker
But you still have a bigger chunk of cloud already implemented. So see what you can do in the hemisphere of that which can stitch a story for you where consumers are not really hurt, that the the sentiments are not hurt.
00:27:15
Speaker
So those are my sort of predictions and suggestions and advice ultimately on what we need to focus on when it comes to consumer face. I'll stop there. No, I think ah ah if anybody is trying to set up SRE, I would suggest that person to just listen to these five minutes of your insights, right? Because this is very to the point, right? I mean, you are very clearly saying that what you need to monitor Second is what are the things which you need to do like synthetic monitoring, distribution?
00:27:46
Speaker
One very important thing, which I think not many people talk about is the business dashboard part. In fact, at VueNet, 90% of our dashboards are business dashboards. Because we are trying to sort of bring in the business context into this machine data, which basically you get from those agents or agentless methods, right? Finally, the telemetry is all machine data.
00:28:09
Speaker
It doesn't give you any context of what business, what application, what criticality, what... ah ah customer impact and so on is being ah is somebody having right without that business context.
00:28:21
Speaker
So v the i I can't agree more on this that ah while ah people do not want to look at dashboards per se, but this business dashboard has been solving a very critical problem which has very critical has not been solved earlier.
00:28:36
Speaker
let let just Just to amplify the importance of the dashboard, and um I'm happy that we both agree on to it because I'm a huge, huge advocate of dashboards, equally respectful to the alerts as well.
00:28:49
Speaker
ah But the the world is changing from a dashboard standpoint, which we talked. But I'll also tell you this. i mean you we have We have talked in the past about this call of toil reduction in the world, ah one of the pillars of SRE. ah Without the dashboard, I'll tell you what happens.
00:29:03
Speaker
the ah constant ping from various different people who are not very much into technology in your corporate world. It could be product, it could be business owners, it could be the leaders, etc., etc.
00:29:15
Speaker
They're knocking on your door. It could be L1, L2, L3, SRE, engineers. It amplifies a more structured type of a solution to all those people where your work, yourre you're actually reducing the toil by building the right appropriate dashboards as well. So there is a component of dashboard which also solves the toil concept, which we can talk later on.
00:29:35
Speaker
you know So, completely agree. In fact, my name next question was going to be on that toil reduction only. I still remember you explaining it when I attended that session at Nike, um ah where you talked about how ah some of the work which you have been doing in Nike was more for toil reduction rather than anything else, right?
00:29:54
Speaker
So not taking too much of time, but can you just quickly define what exactly trial reduction is? And then if somebody is setting a SRE practice, why they should worry about it or what they should care about it?
00:30:07
Speaker
Yes, wonderful. um I mean, ah this this could be its own episode because i think for quite a few reasons. The breadth of toll reduction is very vast. Second happens to be my favorite subject onto it. And as you rightly said, Nike, ah my attention span was too much onto toll than any other pillars of SRE.
00:30:25
Speaker
But ah to get on, I'll keep it as brief as possible. As we said, toll is one of the foundational pillars. of SRE. I mean, it this has been referred in in the in the Google established SRE book also.
00:30:38
Speaker
As we have observability, as we have monitoring, as we have chaos engineering, trial reduction sits out there because um and it goes to the fundamentals of thinking like a software engineer and not thinking like it because an administrative L1, L2 is about restarting, updating the system, updating a config file, updating the database.
00:30:55
Speaker
That's operational nature. Engineers at the same time, let me go solve from a coding standpoint where I'll make it as a functional requirement. But guess what? Between these two entities, there sits something called chronic problem.
00:31:07
Speaker
Like how we talk about chronic issues in the health, there it exists chronic issues on this technical system as well, which unfortunately nobody wants to take care of. Engineering would be like, hu we have an L1 there, restarting it.
00:31:20
Speaker
Elvan will be like, why should I keep restarting it all the time? So that's where SRE really comes in with that software engineering mindset. Probably this where claim and probably get into the red eyes of engineers also, where we try to optimize the solution better than the engineering teams. And the word is where toil comes in. Let's take an example of this.
00:31:38
Speaker
And it's nobody's fault. Toil exists in the organization for various different reasons. One, um let's take an example of compliance. Now, you have ah something to be performed on your system.
00:31:51
Speaker
Let's say port opening. okay That, for a compliant reason, you can't give access to the engineering teams or the business users or the product. where it's ah It's simple sort of an execution out there which which could be performed very easily by anybody.
00:32:04
Speaker
But guess what? That transforms into a chronic tasks either to L2 or an L3 team because we are not supposed to give access to them. So this is one example of all. I mean, I can go on with that.
00:32:15
Speaker
But then the the the form the point is productivity is bad. You are hiring more and more people, especially in this world of AI and whatnot. um And that's where we'll probably talk about AI ops in the later stage ah in the world of SRE. But not everything is where it has it needs an artificial intelligence.
00:32:31
Speaker
It's a common sense type of a problem which exists, which even today's day is solved by manpower. And that is where our passion was. Like when I first got a hang of slack in the world of Nike, and when we realized that you can create apps slash bots onto it.
00:32:46
Speaker
It blew our mind. Saying that, um i and one more i'll I'll explain that, what I talk about a bot with an example. It is, Bharat, you know, coming from Nike, we had a responsibility of around 1000 stores, ah retail stores.
00:32:59
Speaker
And every alternate month, we had to release the application, knowing that it's a legacy type of application. We had physical servers in the store, four walls of the stores. And we had to hit 1000 walls of the stores, every single ah once in two months.
00:33:12
Speaker
Now, you are firing up commands after commands out there. You have restarted the servers, you have validation of the servers, you have installation of the things. And when this is all happening, then you have to do validation, power which is you go log into the server. you Now, obviously, the automation part of it would have kicked in a long time back, which is write a script, you fire a script.
00:33:31
Speaker
But then that demands you to be a specific L3 person with a specific set of access and to be executed on one particular server. Because that is where you have access.
00:33:43
Speaker
Guess what? If that all of those things can be leveraged with the right access, with the right people, with right workflow, with a single message on a Slack, That's toilet exam, in my words.
00:33:55
Speaker
ah did So that that's what we achieve. A thousand which used to be done by one person and he was not able to take time off during the day of release. He could actually get up in the morning or anybody could get up in the morning before the store opens. Even if you're brushing, you're sitting on come sorry for my words, but they have to just send a message on Slack and it gets executed. But the important other fact which I talk about toilet exam especially the collaboration tools like Slack, Teams and stuff like that is, you're not doing it in silo, which means when you do something on a web website, it's one-on-one relationship.
00:34:28
Speaker
That's what I always say. Whereas when you do on a Slack channel, you're doing as a collaborative team, which everybody is seeing what the other person is execut executing. So that's the beauty of it.
00:34:38
Speaker
And when when an answer comes in we have been plugging here also into the Slack model now. When the answer comes in from one person saying that, hey, tell me what is rate limit on AWS? I wanted to know.
00:34:50
Speaker
But now Bharat, Nandita, everybody gets to see that answer at a collaborative one. So that's the power of Slack. That's a different topic. But that's a classic example of toilet action, which I'm very passionate about.
00:35:02
Speaker
So that's the whole story about toilet action. I'll take a pause. Yeah, I believe this basically has helped enterprises like Nike not only saving cost, but also enhance their customer experience and then revenue and so on It's an indirect impact, right? I mean, whichever way you look at it, reduce cost as well as maybe increase the revenue.
00:35:26
Speaker
Absolutely. And not just that, talking about the collaborative tools, although I'm diverting from the tall reduction, but collaborative tools, which also is very much SRE in nature, is this is that nice place, Bharat, where you, i mean, there's always a conversation about AI replacing humans or ah humans being redundant when there is an AI.
00:35:45
Speaker
But a fine line and a beauty is, Slack where you can add a virtual agent as a team member in your channel along with the humans. I will tell you what happens is when you have, ah it assume that your channel is customer support and you have integrated lot of inputs come as ah in your unit.
00:36:04
Speaker
If you have Slack, you have customers who have a channel focused on to one of the things and when they come and report, hey guys, metrics are not flowing, you know, something like that. Bot can do its job because you have installed a bot out there. it can can you But the frustration for the customer is gone that only bot is looking into it.
00:36:22
Speaker
Because humans are there in their channel. When they realize they want to contribute more, ah they step in. So this is a perfect harmony in collaborative tool where bots and humans work together to solve customer problems.
00:36:33
Speaker
Right. That is very, very interesting. Ravindra, coming to ah ah some sort of guidance to people, right if you are asked to program manage a critical, scalable, reliable user application, right ah what would be the top three things you would definitely do?
00:36:53
Speaker
And what are those things which you will definitely avoid? Okay. Okay. um um Top three, top three. um Wonderful. So thinking on those lines, let's do this.
00:37:08
Speaker
I'm taking time to think on that. Here's what I'll do. I'll give the like sort of a high-level pointers of the three which you are asking. I probably will double-click on that because I am now realizing that um um and I'm making it too long of every single question of yours.
00:37:22
Speaker
So let me answer to the point. um I think first and foremost is... ah clear objectives and metrics. but what but but Every company is different.
00:37:34
Speaker
Every CEO has given a guideline on how he thinks he wants to scale up his company. So we need to align to that rather than SRE telling like, no, SRE has to be just done like this.
00:37:46
Speaker
No. We need to align. So my my first point on when you're talking about critical, scalable and reliable application and when you're talking about reliability as part of it, understand what are the objectives and what metrics would support those objectives because objectives are coming from top to bottom.
00:38:05
Speaker
what I can do is think of the right metrics which can support to that view that goes bottom to up so that my CEO's vision is achieved type of thing, right? So that's my first. Basically, um ah if they have decided that I want my site to be running 24 by 7, there would be some companies who say my site is required 9 to 5.
00:38:23
Speaker
So you will decide what is a KTA like uptime. So that is where understanding the mindset of the company is extremely important. Now let's talk about thresholds. ah you you You might be like,
00:38:36
Speaker
If I suddenly get a ah demand on my company website at millions per per day, then I don't want it because my manufacturing swing unit is not ready for it or something of that nature. right So that helps you determine what is my threshold.
00:38:51
Speaker
And then is latency. Latency is like, did I like, so like, like, blink it? Have I committed to 10 minutes of delivery? Have I decided to do a two day delivery? So latency is in terms of that. That is why I keep saying it, align to your company's vision when it comes to objectives and metrics.
00:39:06
Speaker
That's my first point. Second is um ah like architecture and design. um Like reliability, I told you, reliability is in parallel with the same starting point of your software development lifecycle.
00:39:17
Speaker
So when when when that team decides what the architecture would be, what tech stack, what type of cloud, what type of active, active, active, passive and all, everything everything matters. And this is why it's so important because we run in parallel with that.
00:39:30
Speaker
So my point be, invest time in a robust and architecture and design. What I mean by that, again, two or three pointers is, say, load balancing, CDN, caching, circuitry, probably failovers.
00:39:44
Speaker
All of these type of parameters becomes very important to be decided. So that's why you design and architect that. And finally, ah last one I would, again, put it into a bucket as a point is prioritizing testing.
00:39:57
Speaker
um and testing I obviously, function team will decide their testing with performance. I talk about our, we know, chaos engineering. we We need to break the system intelligently so that I know if my absurdity is working or not.
00:40:10
Speaker
So that's the one which I talk about testing. ah And most important is the incident management. um See, you might have the best of the tool. VueNet might have implemented the best of the metrics.
00:40:23
Speaker
But if I am not transforming that alerts and information into a right incident management system, It's just junk of the data ultimately sitting down there. And, you know, I hope you agree.
00:40:33
Speaker
In today's world, the products like yours and many other applications have gone so better that there is information overload for either the SREs or L1s and L2. Identifying what information should transform into an incident so that I can go solve it.
00:40:49
Speaker
ah Because i can take the example of Kotak. I have no shine saying that. Since we have product, your products and many other products, they have the capability to notify information, warning, and critical type of categories, right?
00:41:02
Speaker
If I don't bake that absorption very appropriately, now I'm intaking info as incident, warning as incident, critical as incident, which then becomes a point when real critical alert comes in, right?
00:41:14
Speaker
Nobody's looking into it. ah death That's the problem. It's all information overload. That happens in one of our customer environments very recently. yeah daily ah We had just implemented very recently there and then we had given training and so on.
00:41:28
Speaker
And that alert keep coming, but nobody else ah took notice of it. And then sixth grade, the whole system went down. Then we did an analysis of why system did not generate it an alert.
00:41:40
Speaker
Then we told them, boss, system has been generating alerts. It is just that nobody is making notice of it. Yeah, bingo. That's where i am and I can relate to yours, where I also stand in a similar situation.
00:41:51
Speaker
We have so many alerts firing up. ah What happens is big deke. Start ignoring the thing, ultimately turning into an a major fire. So incident management, that's why it's so critical. Assuming that VueNet would fire an alert at a very different level.
00:42:05
Speaker
Transforming that into, say, ServiceNow ticket or a page-to-teach-offer ticket or any of those natures. We should have an understanding of when would I really call a page duty? When will I really create an S1 ticket?
00:42:17
Speaker
But at the same time, notifications should start going in a way which makes sense for somebody to gauge I need to act now versus later. I think that's where I talk about incident management.
00:42:27
Speaker
But that you nailed it. that's That's exactly the example. So those are but there sort of pointers to sort of think for a scalable and designing a critical ah system of a reliable engine.
00:42:38
Speaker
Now, I'll quickly say about the three ah points to avoid, which is overscaling your observability as well. I mean, I don't know how you're going to see this line of mind. I'm talking as a as a guy who implements, but you are a guy who actually creates a product for that.
00:42:54
Speaker
So don't get me wrong, but here's the thing. I think we, the companies, try to absorb everything which comes out of the real-time monitoring solutions, like Vionet, Splunk or any of those products, because you all have developed such a niche layer of capabilities.
00:43:10
Speaker
And we don't know what we want to really use in a rightful way. We talked about the overload information also. So what i suggest always is avoid over-engineering of your observability at the initial stage.
00:43:23
Speaker
You need to think of an MVP type of a layer. Like what real metrics really make sense to me. and So that's one what I have as a suggestion. um And scalability is important.
00:43:34
Speaker
Don't avoid scalability till the last minute. That's probably one of the other points. And finally is um ah don't miss on the communication and alignment with your stakeholders, specifically in the microservices world, specifically knowing that we have silo of the product teams also working with business, where the final products ultimately come as a marriage at the final stage of testing.
00:43:56
Speaker
right So we as an SRE, we as a people who use products of VueNet or s Splunk, any other products that we use, ah the communication of what we are developing with the engineering firms and the product and as the business stakeholders is equally important. So my point is don't avoid communication.
00:44:16
Speaker
It's okay if it's over communication, but it would be not okay to go back and say that, oh, we didn't consider you're that requirement because we didn't know this type of thing. I'll stop there, Bharat.
00:44:28
Speaker
yeah Very interesting. um See, this whole technology game has been evolving, right? Look at the application architecture that has been evolving. ah The way we have become from something which is within data center to hybrid to completely cloud native also has been changing.
00:44:47
Speaker
And the same way the reliability also would have avoided, right? In today's time, how do you define reliability as well as can you throw some lights in ah when you're managing reliability with focus now on ah melt you talked about, right? That metric, event, log, traces.
00:45:05
Speaker
How reliability has changed with this technology landscape changing? um Guruvan, let's start with this.
00:45:16
Speaker
How about I say that ah first change what I observe is we are moved on from the world of monitoring to observability. I want to convince you on that statement. Here's where I'll expand.
00:45:28
Speaker
In the past days, we were monitoring, which is, how is my CPU doing? How is my memory doing? We were doing this in, remember our desktop days as well, like right-click system metrics and this is like the graph, I immediately go and kill the applications.
00:45:44
Speaker
i mean That was monitoring part of it. Now we are in observability part of it where we consolidate things to give us one picture. So we take metrics, we take log, we take traces, not to independently verify whether my CPU... Because in today's world, we don't... I mean, I'm i'm trying to put up a stance saying that we are transforming monitoring to observability is because I don't go take action immediately if I see a CPU spike.
00:46:08
Speaker
Right? and I don't do that. I then go and check on the logs. Right? Then I go check on the traces. So you see that I am observing the things right now. So that's the technology change. We are moved away from observability.
00:46:20
Speaker
We moved into, um sorry, we have moved away from a monitoring. We have moved into observability. That's my first thing. Now let's let's slice that itself into that MLT component. There again, monitoring, sorry, metrics have changed, logs have changed, and the traces have changed.
00:46:35
Speaker
The first one is um talking about, if I want to talk about the hotel agents. The OpenTelemetry has as made the world in a more connected way. um ah Like, I mean, the products, your products, or every party products, it it gives a very nice way of connecting to OpenTelemetry in a very seamless way.
00:46:54
Speaker
That was not the case in the past. It was about very specific opportunities. application dropping its own metrics type of a world where I have to capture that and ultimately analyze. Now, you can slice and dice in a way what type of metrics you want to see and what type of metrics you want to integrate with VueNet, what type metrics I want to integrate with Splunk, etc. So, the things from a metrics standpoint has changed in and in that angle.
00:47:17
Speaker
um I wanted to touch on one more point ah with respect to, I think the last point was about exporting. There was a time when we were struggling long time back of sending the metrics in a proper channel, which is the adaptability part of it.
00:47:34
Speaker
um And we as an engineers had to struggle to get the metrics on to the right because ah you talk about the ports, you talk about getting more compliances, agreements, etc. etc. But now, with OTL as an open telemetry and whatnot, it's it's very, very seamless in today's day. I mean, my team can establish open telemetry type of an implementation for a bunch of services in matter of hours, not days.
00:47:55
Speaker
That's how it has really evolved. So that's one of the difference from a metric standpoint in the past to the the current world what we see. Lobs, but the logs have changed from...
00:48:09
Speaker
If I was an engineering team which was writing a code called A plus P is equal to C, if that was my program, I would have written logs which makes sense for me to troubleshoot if there is a problem. That is gone in the world of today. Now logs are standardized.
00:48:25
Speaker
you there there's i mean Even if you don't standardize, the tools like yours or many other teams, they standardize the tool. There is an ETL in logs. but we One friend, we were doing extraction, transformation, logging, ah loading for business transactions.
00:48:40
Speaker
Nowadays, we are doing transformation to the log information as well. So that is how it has changed in the today's world of logging part of it. I'll move on. I know I keep taking too much effort in every single part of it. No, no worries, Ravindra. Please take your time because we need we we definitely want to give ah everyone and much deeper view into how um ah they need to do certain things, right? So yeah, this is a pretty good input.
00:49:06
Speaker
Please go ahead. Yes, wonderful. And finally, I learned with the tracing. So this is how we started off with our point. One was in the olden days to the new days, is how the things have been changing. My journey was to tell you that you moved away from the world of monitoring to observability.
00:49:20
Speaker
But observability means we are now observing the holistic point of MLT as one view. So then each of that has seen the transformation. So first I explained to you about the metrics, then is the logging. And the last part, in my opinion, is the tracing part of it.
00:49:32
Speaker
So when we compare the olden days of legacy applications or monolithic applications, it was all about everything was tightly coupled. So everything was so together, but it comes about, okay, if A, B, C was the path, if something failed on A, meet raising to C was not a difficult thing at all.
00:49:50
Speaker
It was all there. Whereas in today's world, it's microservices, it's distributed between teams and probably companies. So I connect to another company ultimately to get then some executions done and whatnot, right?
00:50:03
Speaker
So tracing becomes, tracing is just not about My team to the other team, can and do I have the tracing information? Now it's like, can I send it to another company which is serving me the other part of the component?
00:50:15
Speaker
Can I get back? And in my end-to-end dashboard, can I get a complete business transaction? Easy word. And that's exactly what has been possible because of distributed tracing. Products like yours enables the, what I can call it as a spy.
00:50:29
Speaker
We inject a spy He goes on, he multiplies through the traversal of the roots of the tree. And when you look back, I can actually establish the tree on a nice dashboard at any point in time.
00:50:40
Speaker
So an L1 engineer at any point in time, if he gets a call saying that your website is dumb, days are gone where he has to call up the UI team to say that your are UI is failing.
00:50:50
Speaker
And that is the one which leads to MTTR increase, which is mean tank for resolution, because every other team has to analyze and bump it up to the next layer. Now, the L1 guy, without much of a technical skills, as long as the right distributed tracing dashboard is in front of his eyes, he will see the red color out there, saying that which deep of the root is really causing the problem and he can simply route the ticket.
00:51:13
Speaker
So if anybody challenges me like, why are you claiming for distributed tracing? How are you going to magically reduce the NPT? It's a simple example of this. L1 can identify the right team, which means automatically my entity drops.
00:51:26
Speaker
So that has evolved, which is not the case. Tracing was an unheard type of a topic. ah Even anna this is very controversial statement. Unfortunately, even to this day, many teams, for whatever reason, tracing becomes a last priority. I just don't understand why that's the case.
00:51:45
Speaker
If I had to ever prioritize as an SRE, for me, traces would come in the top. I understand but why it is the case because... Metrics and logs are so crucial for engineers to figure out what the problem is and they want to solve it quickly.
00:52:00
Speaker
For them, it is M and L. But when I look in in a holistic end-to-end, because that's my responsibility, for me as an SRE, raisin becomes extreme because my intention is to reduce MTTR. So anyway, i that is a thing apart. But these are the way how but things have evolved, changed, and now we are constantly adapting to ah more the fancier your products actually get into the market.
00:52:21
Speaker
Well, I think this is very interesting times also because a lot of standardization is also happening because of OpenTelemetry. And one of the reasons we onboarded OpenTelemetry into our product was providing some sort of vendor agnostic method to the customer that was once you implement or install this OpenTelemetry, AJ, you would not really need to Think about moving or changing your application again, if at all you want to use any other product, right? Any other object-based platform.
00:52:56
Speaker
That was one. very um That's a very strong point. I think it's important. mean, I want to acknowledge your statement, Bharat. If you allow me for your bang on because organizations, ah you you name it for the cost reasons, you name it for ah technical reasons, etct etc.
00:53:13
Speaker
There comes a time when probably we might hop on to different products and whatnot at any point in time. Providing these type of uncoupled type of a relationship ah makes your product more interesting for companies to adopt. No doubt about it. Okay.
00:53:26
Speaker
yeah the The second part which we have also started seeing more recently is that people have started sort of creating a data pipeline for their observability data also. In fact, we have started recommending customers that why don't you create some sort of a data pipeline ah where whatever comes right to this data pipeline is available for various products, ah ah business logic, or reporting OLTP sort of databases and so on to consume metrics.
00:53:58
Speaker
and And that also sort of, again, we moving our customers towards some sort of standardization within their boundaries so that any new application, any new infrastructure, any new data center can be onboarded very quickly because you have now sort of standardized ah how you are going to monitor or observe your application infrastructure.
00:54:20
Speaker
And that becomes very simple or easy for ah customers. right that That reminds me, I think this is exactly what you're saying. If I want to put a word to that, and in our terminologies, I believe we call it observability as a code. um As long as i put it into a pipeline, I know your angle was...
00:54:40
Speaker
as long as we have a data pipeline, ah you're talking about data pipeline, I get it, where people can subscribe to the data and information which we are collecting, but also reminds me the fact you talked about what has changed in the world.
00:54:50
Speaker
That now clicks me the point of observability as a code. Now, we don't need to go in every single time when a new functionality comes into my organization or under our leadership, we don't need to redefine things. As their application are released to the production or the next instances,
00:55:05
Speaker
Since our pipeline is baked into it, observability is carried forward automatically onto that as well. So that easy how it has got to really implement the best practices of ASAVI. Right. ah Although both yours and mine was two different topics, but your statement ah reminded me to the fact to tell about the pipeline.
00:55:24
Speaker
So as an adoption. Yeah, sorry about that. No, no, no. That is perfectly fine. I think that is what this whole podcast is about, right? Points keep coming out and then... Yeah, yeah. Seriously. Somebody.
00:55:35
Speaker
ah Just one very important thing which I wanted to ask you or your opinion, what do you think about it, right? When it comes to observability, ah Do you agree that organization should be moving towards tool consolidation sort of thing?
00:55:50
Speaker
And if at all, yes, if you believe that tool consolidation sort of thing should happen, ah how can a practitioner take steps towards achieving it? but i When I say tool consolidation, I am ah usually looking at some sort of, ah rather than somebody using five different tools, maybe use two tools, but sort of ah consolidate different ah sort of ah type of telemetry which is being collected by different tools coming into the same tool and then you get sort of highly correlated, unified sort of visibility.
00:56:26
Speaker
Bharata, I want to say that that's exactly what's happening in the market right now, in my opinion. When I talk to my friends, my colleagues ah into different companies and where I've seen Nike, where I'm seeing KOTAC right now, um consolidation is the truth.
00:56:41
Speaker
ah but wait So, um although the world is moving into more of microservices, where it is Sido services so that they can stitch together fast, but... the ah bit The bedrock layers where you you create a platform or you create the the the core services and stuff like that is getting consulted to serve the customers. Because the obvious reasons what we have talked so far in the past 50-55 minutes ah is with respect to how the division ultimately is causing the problem.
00:57:10
Speaker
More so as a reason, you need to have an unification, ultimately of the product or the technology at least, open telemetry or those sort of things, so that the handshakes can happen efficiently. The end-to-end observability can come in, where you can see the bigger picture of your reduction of MTTV, MTTR, your SLIs and SLOs. If you are all serious, if any company is serious about those technologies,
00:57:34
Speaker
consolidation has to happen. What you can't really do, is I mean, I don't know, this is where you can educate me on the fact, which is, There's Vionet, there's Splunk, there's Neuralic. Do you all have a bonding and a handshake between you each other? Like, assume that there are three products used in a company where you can come and together give a one single end-to-end dashboard.
00:57:52
Speaker
Probably not, to a good extent, right? Because you're all operating at your own rate. So that's why consolidation is not whether it's happening or not. I can code that it's happening.
00:58:03
Speaker
i can I can watch for it. But I believe that is bound to happen further as well. if If I answered your question, yeah. Yeah, yeah that that's very true. ah Just to give you a view from unit perspective, ah we do integrate with some of these industry tools, ah right? oh then okay um We try to sort of create a ah unified visibility.
00:58:23
Speaker
But the challenge what we ah see in such approach is that the the data which you get out of these tools are very limited. Yes. yeah So you can get only so much ah by integrating.
00:58:38
Speaker
ah So that is where the tool consolidation ah start happening because then you get lot of raw data, which you can really do high correlation, contextualization and so on, and then bring it into a a single unified console.
00:58:51
Speaker
Yeah, exactly. But it is if at all, I mean, I don't know whether you you plan to ask me this or not, but I want to be proactive with this thing about um consolidation is one part of it. Other one is a consideration of a product, right? I mean, it is about what is usually company, do I need to go buy a product quickly?
00:59:12
Speaker
Even if I'm buying which product, what's really fitting to me? That's a big topic of its own. Although people at a high level just think about the cost. ah Cost is important, no doubt about it, because there's no free money with... Any company doesn't want to just waste the money at it, right?
00:59:26
Speaker
So cost is part of it, but there are some important parameters which ideal ideally any company usually thinks about it. um I think simple term would be scalability and performance. um There's VueNet, there's many other... like You have competition too.
00:59:39
Speaker
But it comes in, it's one of the things... Again, i don't I can't speak for the other leaders, but if me and you are talking about it, it the simple conversations would be about scalability and performance. so Because um I don't need to explain that. it's It's as simple as ever ever everything what we've talked about.
00:59:55
Speaker
Right. We talked about the observability, which I keep making a sign like this, which is MLT has to come together. So the most important thing is, are you coming together?
01:00:07
Speaker
I mean, how would you enable me to capture the journey from monitoring to an observability type of a transition? So that is important. ah Then few of the other things is ease... ease of doing business, not of going and buying your product or something like that. by mean ease of doing business with your tool in my world, which is how quickly can I go plug it in?
01:00:29
Speaker
Now, am i are you telling me that I have different agents or different type of the three of the MLT or are you giving me one agent? Now, nothing is a good or a bad answer there because some companies provide one single library versus other companies giving multiple agents.
01:00:43
Speaker
But the problem is with one agent, we have seen there's a trouble in the scalability and the performance part of it. When there is three of it, it becomes too much of nuisance for us to ultimately constantly adopt. and So, I mean, and then the alerting part of it.
01:00:58
Speaker
Now, companies are adopting to the point where I can make a mistake mistake like because you need to understand that SRE is not a big fat team, but there will be hundreds of SRE. SREs are more of a consulting type of a firm.
01:01:11
Speaker
We cash in on the engineers of engineer every vertical so that they can learn from them. We put up the best practices and they go, but no matter what I do, their way of adopting could be very different.
01:01:23
Speaker
Now, we would have said that, put this as an info, put this as warning and put this as a critical. They might not do it. Now the question becomes, how is this tool empowering with the AI part of it or be the intelligence which is there in the technology which will have some sort of, let's say, deduplications.
01:01:41
Speaker
Do you empower deduplications? so So it's about the niche features. So if you're now saying that's what, that somewhere people talk about the cost. Cost is one part of it. But cost can always be overtaken by the premium of the services which ultimately comes in as a package.
01:01:55
Speaker
i'll stop my I'll rest my case there. But I just wanted to expand that. It's not about whether the kind companies are thinking about consolidation. Consolidation is a must. It's about whether, will I build my tool on or will I go and buy the product?
01:02:08
Speaker
The tool building also is like the open source what people talk. It's about the size of the company, the technical capabilities of the company and the energy of the company. So I'll stop there. I just wanted to go on to the I will just extend the last part which you mentioned, ah which we have started realizing is very common ah with respect to um what we have seen in customer environment, right?
01:02:33
Speaker
Is the culture, right? So, while you can actually go ahead and build ah maybe a very good tool set, it could be fully open source or it could be a mix of open source plus plus commercials, right?
01:02:48
Speaker
But, Observability is as much as about these tools and processes, as much as about the culture ah you need to develop within the CEO.
01:03:00
Speaker
How have you fostered a culture of like trans transparency, accountability, and I am started calling it culture of observability right in your teams?
01:03:12
Speaker
So, um it's been six months, right? I mean, I think, although it's been eight months in and and the way that I am right now, but I've been hired to build up an SRE unit.
01:03:23
Speaker
It's been six months where I'm an actively being participant of this journey of SRE. So, um ah no doubt, ah it has taken a lot of effort to sort of induce the culture, what I've learned from my history.
01:03:35
Speaker
It is... SRE is not production support. SRE has to be not um like every team should have SRE. SRE is more of a horizontal platform to put up a strong foundation where it can be gone and implemented into some of the engineering teams, train them and come back and go to the other team. that That's my set.
01:03:55
Speaker
So, the the doing that in the past 6 months ah has been challenging but it's also very rewarding. People have started appreciating the SRE because of the fact.
01:04:06
Speaker
Because we have been able to show them the transparency part which is look this is what you have, what has been either procured from a product like Vionet or any other product or what has been built by us.
01:04:19
Speaker
So, you you you put it in front of the people but as with the right documentation, right, what I call office hours. I mean, we have been doing our own marketing because people need to realize that there's an organization called a sari and stuff like that, right?
01:04:32
Speaker
So, That has been my challenge, not about doing the actual work. Remember, I said in the start, we are also great negotiators because we have to seed our ourselves into the teams. So doing is just not a part of it.
01:04:45
Speaker
Marketing ourselves, showing our value to the organization has been the critical journey so far. into me And to me, I mean, I'll stop there. I mean, I hope I answered your question. Let me know if I didn't. I want to answer your question precisely.
01:04:56
Speaker
No, I think you did part of it. But what I was looking at is that ah from your point of view as an SRE head, right? Yeah. What is the culture? Okay, let's let's let's call it, not call it culture of observability, maybe call it culture of SRE practices.
01:05:13
Speaker
yeah sort of How do you make everybody aware ah in your team as well as the teams around you ah who basically you directly deal with or indirectly deal with? could that About the SRE process ah oh impact as well criticalness of this SRI practice.
01:05:34
Speaker
I think you wait for the right timing is one of the answers. I'll explain that. i mean, it's very important because when you go and preach, when everything is happening good, nobody wants to listen to you.
01:05:45
Speaker
I need to pick the time of preaching what I want to preach. And I pick the time when the time is extremely bad in the recent past, which is not at the time when things are burning. I wait. But immediately next, when I go and present myself into the team which had faced that critical issue and the burn of their leadership saying that, why did we not have this? Why did I not get an alert? Why did we not have a dashboard?
01:06:06
Speaker
That's the bait for me. Yeah. So that's that's my style of operating. Well, I had to go outside because every vertical, every leaders, every people are so busy that they don't have the time to sort of listen my preaching at any point in time, right?
01:06:20
Speaker
So you need to get your timing right, just like any other sales department. So as i mean um'm I'm telling you, marketing, sales, because I am an outsider when you look into the engineering team. I am an out outsider, no doubt the word.
01:06:32
Speaker
But thing again, oh that I'm making it very dramatic. ah But in simple terms, um my point is, you need to find the right opportunity and the timing so that your preachings and your your whole principle, the Bible, what you've created for yourself, you need to go deliver the Bible at the right time is is all my my point.
01:06:49
Speaker
But at the same time, have respect of all the engineers and the leaders and architects in the firm. i I want to appreciate the fact saying that SRE is becoming a wave. I mean, I'm telling you that's the reason why they have hired a masari team also, right? they There is no much of a resistance.
01:07:05
Speaker
There was a resistance in the past. There's like, yeah um I believe it. I quote from one of the leaders who said, I have my monitoring established. We have our alert. Everybody thinks at the silo level of I have a problem, I have an alert coming in, but not the broader spectrum of the horizontal connecting the system.
01:07:21
Speaker
So that is the only selling what you need to do, but maturity is changing, no doubt about it. ah The answer to your question is the culture and the selling is at the right time. Now comes the point of within my own team. Like how do they operate? ah How do we train them? Because not everybody is with the SRE practice of the past. They have been a DevOps engineer. they have been doing my zero leases and stuff like that.
01:07:41
Speaker
They have done a pretty good amount of ah tool establishment, be it open source or all. But then the fact becomes, how do we... ah Put them into the bucket of the strategy which I am probably seeing through it. that's the And for that also is the right timing.
01:07:56
Speaker
Constant one-on-ones. Constant team meetings. Taking one or two examples of the bad things which are transposed into a good thing because of your strategies. So yes yeah you need to put effort. It's not that easy.
01:08:09
Speaker
Because in today's world, it's not about commanding. It's about commanding. And within the request, you put your expectations. I think that's the art of it. But ah between two, Bharat, getting the consensus from the outer world, also of SRE, and building our own team of the SRE mindset, I believe convincing the outer world is little difficult than establishing your own SRE team because it's it's sort of in your control to make it happen.
01:08:37
Speaker
I'll stop here. I hope I sort of touched your impression properly. Yes, yes, yes, yes. um navindra uh i don't know whether you have done this before or not but uh if you have to look at commercial observability platform like ours right i mean it's a hypothetical question i mean if you don't want to answer that is okay but uh i just wanted uh for people's reference if let's say uh somebody who is leading sre practice in in in some enterprise if he's asked to look ah for a commercial observability platform, what would be your top three to five considerations or reference checks as the prospective buyer?
01:09:18
Speaker
oh um um and I would not shy away with that question at all. I want to answer this, but ah it's probably for a lot of other SRE leaders who are trying to look into the product or probably who are trying to make the product better. So here's the thing, but and I say every of the next few statements with high regards and high risk respect, and that is this.
01:09:36
Speaker
ah First and foremost, which I would, because remember, I keep saying no matter what is the technical, because technically, you know what you want, you're working on it. um as ah As a customer, I would say cost effectiveness and transparency will outright beat everything else, is my opinion. And I want to expand it a little further.
01:09:57
Speaker
ah Transparency is, but um and this is not only about SRE product, right? It's the, it's a nature of sales. There is glorified view at the time of sales.
01:10:10
Speaker
at At the time of execution is sales are gone. People who really sold the product are gone. It's about their now production support care and ultimately a start to dump. I'm sure you understand where I'm going with this. it's so That's why I amplify the point about cost efficiency and transparency.
01:10:27
Speaker
ah So transparency, what I'm talking about is transparency of really what things are shapeable in us. Transparency of of things how it has really worked with other customers. It's not about making up the things.
01:10:38
Speaker
I mean, you know, I mean, not that anybody's lying. It's about they've been implemented, say... a part of an A, but my journey and my ask was A, B, C. And the during in the sale time, it would be, oh, we have A. But then to realize that B and C has been missed out. And unfortunately, even we, because these things happen in a very time-bound type of environment.
01:11:01
Speaker
Either we are in a very high rush or we are always very busy. So even we can go wrong with our relations and stuff. So that's what I mean by transparency. Now, there is another meaning of transparency, which is um let's talk about the licensing part of it.
01:11:14
Speaker
When we're looking to open... I feel products are sort of making this licensing concept extremely complex. And for the service level metrics, this is the slab.
01:11:29
Speaker
This is the count. This is the tint. If you go on to the next layer of networking, this is the extra license. This is what. I mean, I just wish it could be just in a very layman type of wording for us to understand what gets involved and what gets doesn't involved when I get one license of something.
01:11:46
Speaker
So that's what I mean by transparency. So I'll stop here. I don't want to make it a very dramatic number. But transparency, my hope is... being very transparent out to the availabl the features which are really there to the problems what the customer is calling in front of you.
01:11:59
Speaker
And the second type of transparency and the cost-effectiveness is, can it be simplified in respect the license? Because at the end of the day, it's important for you because that is how you generate your revenue. But it's also very important for you because based on that license and getting the metrics of the logs and traces is what I really wish for.
01:12:17
Speaker
It should not be happening that like I buy certain licenses to realize after three months that I need to come and pay more because this license is not giving me that. I'm sure you understood. I'll stop it there.
01:12:28
Speaker
So that's most important. but understand I understand. understand. Uh, Now coming to the topic of today's ah generation, generative AI, I'm sure you would have come across multiple projects. Maybe you also would have got involved in certain projects which you would have done at Nike or trying to do it in Kotech now.
01:12:49
Speaker
ah Can you provide your thoughts of how Gen AI can help in any of those SRE pillars, including trial reduction, chaos or observability?
01:13:00
Speaker
I mean, I know it is a good question. Yeah, no, no, no, it's very simple. i think I'll take it. um ah I'm probably not that smart to think of complicated. mean, by now you'd have realized that I'm too straight in my answers than I am supposed to be.
01:13:14
Speaker
um But at the same time, even here, I don't think of fancy things when it comes to the engineer. In a simple term, as you use a word, if you would not used i would have used it, which is Toil Detection. I feel the important role, what it plays in the world of SRE is one of the pillars is the toil.
01:13:32
Speaker
also happens to be my passion is the it has a huge contribution in productivity that's what the world has been saying right in today's world we are overspending on operations this is not to eliminate the people this is to make it much faster and better is the thing so gen i really fits in the role out there but now i want to take which people have a great idea so but i want to take a moment and look into the ai component And the generative AI itself, i mean not just as a text message or anything like that, but the output in the observability stack of it.
01:14:03
Speaker
na um ah We were, i mean, I think Ni and you have heard it from Suresh as well, which is, we were in the world of React 2 in the past. where they would go and act when ah customer comes and tells that your website is docked.
01:14:16
Speaker
You have transformed your mature very well into the proactive nature. I've set my dashboards, I've set my alerts, and the world is transferring into predictive. And that's exactly where AI comes in.
01:14:27
Speaker
Again, taking an example of Nike, I know during the period of Thanksgiving, I make around 15 to 20% more on my typical sales. I mean, typical revenue type of thing. More the span of that black thread type of the thing, right?
01:14:40
Speaker
I know two years back, three years back, based on the data, what I hold, that we had spikes like this. We had metrics, which has flown in like this. We had very negative type of sentiments, which has come in. All of those information there because ai is all about the data.
01:14:54
Speaker
Now, the predictive should be, guys, when your traffic increases from one customer to, say, 10 customer, when you go 10 times more onto it, you should anticipate this alert to be fired.
01:15:06
Speaker
And there will be a predictive alert with fires. And tomorrow at 10 a.m., I think you will be... um Because i yeah the system of yours has an understanding that, oh, these guys were operating at, um like, ah three nodes.
01:15:20
Speaker
Now they're bumped up to 10 nodes. And guess what? This data matches to the last year where they've done the same thing. And there last year they these problems, which means you you see where I'm going with this, that the observability tool has a huge opportunity to use the AI and make it into the predictive type of a nature of alerting rather than only proactive type of an alerting when the issue happens.
01:15:42
Speaker
So I think those are my two answers and I will stop here. Yeah, no no very, very interesting. I mean, ah the way we had been looking at Gen AI is also sort of on the observability, obviously observability platform side, that how do we help ah some of these SRE teams or any other monitoring teams, which basically are using the platform to make it more efficient?
01:16:08
Speaker
as then Yes. some week cannot just replace anything and everything but how can you reduce your MTTI or MTTR where you are making them more efficient, provide them the right information we basically are started calling it Gen AI based recommendability So when a problem happens, can I go out and recommend that, okay, these are the five things which you need to look at now, right?
01:16:33
Speaker
And then they can quickly checklist sort of thing that, okay, three things ah the tool is recommending. Quickly look at those three things and then reduce your MTTR sort of thing.
01:16:44
Speaker
I mean, that is very interesting. Very interesting. I didn't really think of that. No, that's that's very cool. Yeah, very cool. So we started with the when we started with experimenting with AI, we started with event correlation coming up with high fidelity events.
01:17:01
Speaker
um Again, it could be our event, it could be from other products, right? ah If they have enough information available in those events. ah Then we also did some sort of anomaly detection where you do not need to have static thresholds.
01:17:16
Speaker
Over time, it figures out what are the bound the metrics are coming through. And then if you find that the the metric is actually breaching any of the bound, either upper or lower, you go ahead and throw an alert.

AI in Root Cause Analysis

01:17:30
Speaker
right So that becomes some sort of a dynamic threshold-based alerts. right The last one which where we use AI for was a sort of probable RCA. Now we are receiving a lot of data from almost every infrastructure component and application logs metrics, cases.
01:17:47
Speaker
Can we identify which of the components are behaving incorrectly? Right? And then once you identify that probable root cause, then we provide this recommend recommendability using Gen AI.
01:17:59
Speaker
Very interesting. Bharat, this is cool. yeah so does this so you I want to ask you a question. I'm sorry, you're the one who's asking me, but let me ask you one question. the and The probable RCA, when you use the word probability, i mean you're you're talking about you're you're doing an indicative type of fun suggestion saying that this could be the cause for the issue which happened in the past?
01:18:24
Speaker
Or are you doing more in the futuristic term? As in, hey, we see these anomalies and I think we expect something to go wrong. So which among the two is it? I just want to clarify. Both are possible.
01:18:34
Speaker
The current one which we have given is, let's say, an issue is happening now. it's almost like a real-time basis. Your application has started seeing lot of failures, for example.
01:18:45
Speaker
right And then, because we are monitoring the complete application delivery infrastructure, which could be our network, server, database, storage, middleware, application logs, application traces, and so on, the whole thing actually is ah ah uses our AI models to identify out of all this component, it could be hundreds of components, which component, which golden signals are causing problem right now, which could have caused this failures ah going high.
01:19:20
Speaker
Fantastic. When I say probable, because it doesn't know exactly ah given problem is leading to this, right? That is why we call it provable and it can be one, two, three, depending upon how the whole infrastructure ah is behaving at that point. operate Yeah.
01:19:39
Speaker
Okay. Okay. Very interesting. let's Let's take an example of, let's say it's a DB related problem. It could be, or AbleDB has deadlocks, for example, right?
01:19:50
Speaker
Then once you click on it, it gives you the recommendability, which could come from your ITSM solutions, that how you had fixed this problem earlier. And can give you that, okay, you go to the database, figure out which users or which query is currently in that state.
01:20:08
Speaker
And then it will say that, okay, go ahead and maybe stop that query restart that session and so on. Right. Okay. That is what the recommendability part of, which basically is generated using Gen AI. Very nice. Okay.
01:20:20
Speaker
Okay. I mean, I would like to read through it in detail for sure. Definitely. I'll reach out to you. So we basically are doing one, a couple of pilots now. The first part, the probable RCA part.
01:20:31
Speaker
Recommendability is still sort of beta or sort of being generated, worked upon. It's a work in purpose sort of. Yeah. Okay. Fair enough. Thank you so much, Chiravindra. I have just one last question. This is something we have started asking all of our guests because ah we want to learn from your experience, right?

Inspirations and Influences

01:20:51
Speaker
So, I wanted to ask you what are your favorite books? Which is that book which you keep going back to? ah Or which is the book you generally give people as gifts when you want to give somebody?
01:21:05
Speaker
And what you would recommend to people? Yeah. Okay, i'm go I have three different type of books based on three different type of questions what you have. Like what is my interest? um What do I keep going back to? And what do I gift?
01:21:18
Speaker
So let me answer three of them. Have my interest as considered? Bhadath, I'm very much into biographies and autobiographies. i'm I'm not into fiction and I'm not into non-fiction also.
01:21:28
Speaker
um i um I'm fascinated and blown away by people's brilliance and what humans have done to this world. in a good way. mean, there bad things also that has happened, but still, there's the manpower in there. And I would need it a lot of time because I think it's very easy for me to get tired and demotivated and whatnot. But these reminds me of... So, my niche is biographies are not over. And with that, again, the book which I love is mostly the controversial one, which is Elon Musk. There are only two types of people with Elon Musk.
01:22:02
Speaker
Either they love him or they hate him. Huh. but So for me, Elon Musk's books, there are around four to five different authors who have done his book at various different stages of his life. The most latest what I've done is Isaacson, Walter Isaacson, I guess, in 2023.
01:22:19
Speaker
As a matter of fact, there is one which has been released now in 2025, which I've not read yet. So my latest has been released. So that's... That's my taste. spider What I gift usually is from another set of the readings which I do is about self-improvement. For me, the the great book is the Habits and the other book is Atomic Habits.
01:22:39
Speaker
Something which I usually gifted to youngsters. ban people who are already into the world ah ah the the the world of earning money. I do that with youngsters because I feel that is the one which could be molded because the later part is dumb and set. I don't think habits could be changed so easily.
01:22:57
Speaker
So that's the one I gift. The gist part of it is, ah yeah I come from Nike, no doubt about it. So my most common gift to anybody is the shoe dog. ah Because shoe dog from Phil Knight was a founder of Nike.
01:23:10
Speaker
that That actually is in the top 10 of the books of the world's motivational type of reading. So that's my most common book, which which I sort of gift to everybody. your So that's a long answer for your question.
01:23:24
Speaker
No, no. Thank you so much, Ravindra, for being so candid, so insightful. And I'm sure a lot of people who will listen to this, ah they will get so much ah for their own if they are basically following the SRE path or they have anything to do with SRE as well as ah observability or on reliability and applications and so on. I'm sure they will find it very, very insightful.
01:23:47
Speaker
Thank you so much for your time, Ravindra. But at the quick moment, or it's like pleasure probably is all mine, honestly, because I met you first in 2024 or 2023. I can't really remember.
01:24:00
Speaker
But as I say that at times, the first impression puts ah ah put in a very strong impression. You are one among them. I loved the way you and Suresh

Final Thoughts and Reflections

01:24:10
Speaker
were talking. And then we had lunch together. I remember the conversation which we made during that 30 minutes of a lunch.
01:24:16
Speaker
So I have very high regards for you and and and the way of how you're grown up to your own firm and all this type of things. So I respect and it's it's absolutely a great time for me to spend a few minutes with you. So thanks a lot for having me.
01:24:29
Speaker
Thank you so much. Hope you found my discussion with Davindra Insightful. If you did, please consider sharing it with your colleagues. For more information about UNIT Systems, please visit us at www.unitsystems.com.
01:24:46
Speaker
Thank you.