Oops!Zencastr was unable to start because Javascript is disabled
To fix this problem, check your browser's settings and enable Javascript

Become a Creator today!Start creating today - Share your story with the world!

00:00:00

00:00:01

Episode 12 : Observability is a Mindset- Unlocking True Visibility Beyond Tools

Observability Talk

83 Plays4 months ago

In this insightful episode of Observability Talks, Shanky takes us on a journey through the evolution of observability, drawing from his extensive experience in e-commerce, BFSI, and large-scale digital systems. His candid discussion highlights the cultural, technological, and strategic shifts required to move from traditional monitoring to proactive, business-centric observability.

Recommended

Episode 15: Rethinking Transaction Banking in a Digital-First Era image

Episode 15: Rethinking Transaction Banking in a Digital-First Era

Observability Talk

00:59:14·2 months ago

Episode 14: Leadership Strategies for Digital Transformation: Navigating AI, Observability and Automation image

Episode 14: Leadership Strategies for Digital Transformation: Navigating AI, Observability and Automation

Observability Talk

00:37:07·2 months ago

Episode 13 : Enhancing System Resilience and Reliability through SRE image

Episode 13 : Enhancing System Resilience and Reliability through SRE

Observability Talk

01:24:53·3 months ago

Episode 11 : Building Highly Reliable, Scalable and Observable Fintech Applications image

Episode 11 : Building Highly Reliable, Scalable and Observable Fintech Applications

Observability Talk

00:39:10·4 months ago

Episode 10 : OpenTelemetry to eBPF: Understanding the Changing Landscape of Observability image

Episode 10 : OpenTelemetry to eBPF: Understanding the Changing Landscape of Observability

Observability Talk

01:26:53·5 months ago

Episode 9: Network Automation & Observability - Critical for Digital Transformation Success image

Episode 9: Network Automation & Observability - Critical for Digital Transformation Success

Observability Talk

00:58:17·6 months ago

Episode 8 : Redefining Observability and Trust in a Digital First World image

Episode 8 : Redefining Observability and Trust in a Digital First World

Observability Talk

00:55:19·6 months ago

Episode 7: APIs, AI and Application Reliability - A Deep Dive with Vinayak Hegde image

Episode 7: APIs, AI and Application Reliability - A Deep Dive with Vinayak Hegde

Observability Talk

00:57:28·7 months ago

Episode 6: Crafting Resilient Products - A Closer Look with Sarika Atri image

Episode 6: Crafting Resilient Products - A Closer Look with Sarika Atri

Observability Talk

00:35:58·9 months ago

Episode 5: The Changing Face of Cybersecurity in India with Vinayak Godse image

Episode 5: The Changing Face of Cybersecurity in India with Vinayak Godse

Observability Talk

00:42:40·10 months ago

Episode 4: The Impact of Digital Transformation on Banking and Payments in the Middle East image

Episode 4: The Impact of Digital Transformation on Banking and Payments in the Middle East

S1 E4 · Observability Talk

00:28:35·11 months ago

Episode 3: Building Resilient Systems: SRE Best Practices and Insights image

Episode 3: Building Resilient Systems: SRE Best Practices and Insights

Observability Talk

00:55:18·1 year ago

Episode 2: Observability Trends and Market Dynamics with Abid Neemuchwala image

Episode 2: Observability Trends and Market Dynamics with Abid Neemuchwala

Observability Talk

00:20:33·1 year ago

Episode 1: Understanding GenAI and its Impact with Dr. Balaji Srinivasan image

Episode 1: Understanding GenAI and its Impact with Dr. Balaji Srinivasan

S1 E1 · Observability Talk

00:31:13·1 year ago

Transcript

The Cry Wolf Problem in Monitoring

00:00:00

Speaker

So monitoring is like, you know, I told you another bank that says we'll wait for a complaint. Monitoring is largely alert based. Right. And most alerts, and I've done this in a lot of companies and it's, ah you know, many people have thousands of alerts.

00:00:13

Speaker

They configure very broad parameters. And so you'll see lots of red and yellow and red and yellow. And you know, like the forest of reds, the actual reds are all invisible. You know cry wolf kind of story.

00:00:27

Speaker

So many people miss the alert and then things go badly wrong and then, you know, then complaint happens. So then somebody reacts, even though they've had full monitoring, you know, fair amount of expensive monitoring. Second thing

Siloed Monitoring Teams and Communication Gaps

00:00:38

Speaker

is monitoring as a culture, you know, though they have, though many come and this I'm telling you as a generalization, but I've seen this many times over. So i suspect it's a valid generalization.

00:00:47

Speaker

Most companies have monitoring teams that only look at some very narrow things. So you'll see monitoring teams say system is green. end customer says system is very bad.

00:01:00

Speaker

Because they say according to these three things, my system is green. There is no, um the the firstly monitoring teams and the and the application development teams and all that never meet each other.

00:01:11

Speaker

Yes. It's very common cultural problem that the monitoring team looks at the alerts has its own review within its own world and then sends only a digested report, a PowerPoint to the rest of the world, to the rest of the company, not to the outside world.

00:01:26

Speaker

yeah The application team only sees the digested report, which is largely good news. So it is, for instance, if they know an API is failing, mean, if an API is failing, the monitoring team knows, but the application team doesn't know. So it never improves that API. yeah Correct. This thing is a very common problem. And I'm surprised at how common a problem it is because you would have thought it was obvious, but it's not because the organizations are created in these silos.

00:01:52

Speaker

Right. The silos tend to work within themselves. It is not in their interest or in their structure to go and talk to somebody else.

Introduction to Sankasan Banerjee and Observability Culture

00:02:01

Speaker

So, you know, one of the one of the whole cultural shifts from from monitoring to observability, what does observability mean? Who is observing?

00:02:09

Speaker

It is useless to have a security guard observe that, you know, robberies are happening in the next house, right? It is the house owner who's worried. The house owner ah doesn't have any visibility.

00:02:21

Speaker

So how can he observe anything? he He has no access to the dashboard. He has no access to the raw data.

00:02:38

Speaker

Hi, welcome to a new episode of Observability key Talk. This episode is quite different. For years, I have had the pleasure of engaging in the ideas around technology and observability with our guest Sankasan Banerjee.

00:02:54

Speaker

Sankasan or Sankhi as we like to call him has been an IT leader with a lifetime of experience. He has been the CIO of RBL Bank, the NSE, the IIFL Group.

00:03:06

Speaker

He has also worked extensively in technology consulting with the strengths in Accenture, MFASIS and IBM. Over the new next few episodes, Shanky and I would look at observability through various lenses.

00:03:20

Speaker

We plan to bring more guests who might contradict us, debate with us. Our objective is to go deeper and today we would like to explore how organizations can build an observability culture.

00:03:34

Speaker

Where should they start and how should they go about doing it? we We will also discuss practical tips, real-life experiences and many more.

00:03:46

Speaker

ahanki Thank you so much for joining us today.

Building an Observability Culture with Sankasan Banerjee

00:03:49

Speaker

It's a real pleasure talking to you about something which is very close to both of our hearts. ah One thing which um we I definitely want you to start with, ah can you define observability in your own words and why you believe it is essential for some of these enterprises you have worked with or the new digital first businesses which are coming up?

00:04:12

Speaker

Yeah,

E-commerce Insights and Observability Evolution

00:04:13

Speaker

see, you know, I started with e-commerce a long time ago. And the first wave when, you know, India world became famous, everybody started doing e-commerce. um So i I, you know, in those days, we didn't have any concept called observability.

00:04:29

Speaker

But quickly realized as soon as you do a digital transaction, a lot of things are visible to you. you know, you get a lot more data about the individual bits of the transaction, about where the web where the person clicked and where he hovered and all that.

00:04:41

Speaker

And so as generally across the world, e-commerce players started tracking this quite obsessively. what happens in the whole thing, in the front end, in the very middle end, in the back end.

00:04:53

Speaker

Today, it is a fashion to call it observability. In those days, we used to just think, we have to look at what the users do. And, you know, this problem that they couldn't physically see users. So data was, therefore, a very important criterion to decide what to do, how to go about it, who is doing what, funnel drop-off, there's that.

00:05:12

Speaker

Now, when enterprises started erupting web technologies, they actually kind of left this part out. You know, i helped many enterprises over the period of time and most of my, ah even my enterprise career in my BFSI world and so on has been on the web side of things, not on the traditional ah client server or this, that side of things.

00:05:34

Speaker

So on the web side, when we started building, we built the interactivity and the, you know, the other advantages of web, interconnectivity, easy to get access. But this obsessive ability to view each each individual action, we kind of left out of that.

00:05:52

Speaker

um Primarily because enterprises were not used to doing that. Enterprises were used to looking at the beginning and the end. um What happens in the middle wasn't really, was considered known. For instance, banks never thought of funnel drop-off as anything useful.

00:06:06

Speaker

yeah He's going to make a payment, he's going to make a payment. yeah i mean If he drops off in the middle and starts again, I don't care. When the digital world started maturing, we realized that people actually did care. Banks did care that a payment was not made here because if the system failed or the payment failed, you would go somewhere else and make it.

00:06:25

Speaker

So banks would start losing money. And as fintechs came in, which had much better observability and using this data to do things. See, observability by itself is useless. here It's actually using the data to improve your system that really mattered.

00:06:39

Speaker

And so suddenly the older banks, the PSUs of the world or the older established banks started seeing a drop-off in their users, especially their younger users.

00:06:50

Speaker

People would refuse to use them because the newer, better fintechs and banks would use the data to improve their systems. yeah And therefore, everybody would have a better experience.

00:07:02

Speaker

Whereas the older banks, you know, I remember working this big public sector bank and their idea of observability was, yeah somebody will complain, then I will look at you. But that's not observability. Observability means why do you need to why do you need to wait for a complaint? You can see that the system is already telling you he's in distress.

00:07:21

Speaker

And digital allows you to get down to an individual person's individual transaction as opposed to the old world where, you know, there was no real way to see an individual transaction. So this whole business of going from, complained once he complains, then I will take a look at it, to looking at it to see if anybody is in distress, that is really the journey of observability.

00:07:45

Speaker

Correct. Correct. Correct. So ah because you come with the very old sort of ah technology ah sense, right, that from old school where companies or enterprises had this, Moralutic services, maybe a couple of servers, and then the number of transactions which are hitting the servers was very small, right, to something which basically we're talking about like a million transactions a day or maybe a billion transactions a day sort of thing, right?

00:08:16

Speaker

ah The whole um also has evolved. Can you you know talk about a little bit on your experience of this e evolution? ah You talked about that ah e-commerce time, right? That it was easy for you or you basically bake it into the technology that how people are using the application and so on, right?

00:08:37

Speaker

How this technology has evolved ah wellre from maybe something which was fully monolithic to now something which is now native microservices or serverless sort of thing.

00:08:52

Speaker

So, you know, ah people derive monolithic, but I have a... Monolithic by itself is not bad. there's a There are many... You know, even Matrix had, you know, the concept of the whole mainframe somewhere in the middle, right? The movie Matrix. Right.

00:09:06

Speaker

So it's not that monolithic per se is bad. In fact, monolithic architectures where they work are a lot simpler than distributed architectures or microservices architectures. The challenge with monolithic has always been explosive growth.

00:09:17

Speaker

Because you have one single big thing, you can't suddenly make it double the size or treble the size, right? But NSE, for instance, was largely a monolithic architecture. It was not a microservices architecture. One, because it did very few things.

00:09:31

Speaker

And secondly, yeah at that volume, to give very high reliability, monolithic is actually easier. Though now it is, of course, eventually hitting the limits of monolithic and moving out. But NSE, at least when I was there, was largely monolithic.

00:09:46

Speaker

oh And this was by design because, you know, simpler architecture. It was not entirely monolithic, not one single mainframe, but it was six servers doing a billion transactions a day.

00:09:58

Speaker

So to that extent,

Monolithic to Microservices: Scalability and Efficiency

00:10:00

Speaker

and we tried distributing it and with distributing came so many other complications. What really pushed this towards distributed, towards microservices... By the way microservices is all is ah is a logical way of looking at things.

00:10:15

Speaker

it You know, just because you have 50 services doesn't mean you sit on 50 servers. It can all sit on one single server. Yes, yes. And in fact, it quite often does. ah So this distributed systems architecture push came along when you started realizing that one...

00:10:30

Speaker

Vertical scalability has limits. and but It becomes very expensive beyond a point. So you can build a supercomputer that does everything, but then a supercomputer is very expensive.

00:10:41

Speaker

On the other hand, if you string together a lot of cheap computers, you can get a lot of stuff done. And if you, therefore, especially if you tackle the downsides. So that's that's what the cloud is about. The cloud is not about big computers.

00:10:54

Speaker

Whoever is buying a 64 CPU machine in the cloud is basically making Amazon rich. This is how Jeff Bezos funds his space trips. whoever is buying tiny computers on cloud, stringing them up into a highly reilient resilient network and therefore using that large compute power collectively is really making his life easier because ah it's exponential growth. The cost of a tiny machine is nearly nothing.

00:11:19

Speaker

And therefore, the cost of a lot of tiny machines is also nearly nothing. Right. But you have to make sure that you understand how to manage a distributed system which has its own fairly significant complexities. Right.

00:11:29

Speaker

That scalability push came from e-commerce because e-commerce was very spiky. They were growing very fast. There is a, you know, the difference between, let's say, Flipkart's slowest day and largest day is about a thousandfold.

00:11:42

Speaker

NSE doesn't have that kind of spike. It's very rare that a NSE, if it grew a thousandfold, would seriously have to contend with the limits of vertical scalability. Correct. So NSE grows a thousandfold, but over a period of time, not nearly instantly like e-commerce does.

00:11:59

Speaker

Nikon's is my nature spiky. Because sometimes people will come and shop and then other times they'll go and sleep. um And therefore, it is not predictable. And therefore, the cloud business or the a Netflix kind of thing, you don't know how many people will see a movie.

00:12:14

Speaker

You don't know how many. I built the back end of Hotstar. Well, I didn't build it. I was just a managing team member. I had very bright people below me who built it. But the idea was, you Hotstar would not be able to predict how many people want to see a cricket match. And of course, since I've left, Hotstar has grown crazy.

00:12:32

Speaker

um When we were there, were thinking in terms of hundreds of thousands. Now it's in, it' so I think, 50 million is their last number. The highest number ever. So, again, that kind of spikiness is very hard to deal with in on-premise setups, in fixed setups.

00:12:50

Speaker

So, therefore, it becomes very sensible to have this ah rental model with thousands of servers. Small servers means smaller increments. If you have very large servers, you can put 50 large servers also together.

00:13:03

Speaker

But then each increment becomes very large. And the other thing is, again, this is where observability matters. You know, when we first did Hotstar, we didn't have all this observability business.

00:13:14

Speaker

We couldn't really figure out what to what was going on. was very early days, right? So we just put extra servers is lying around. Just in case, you know, things go bad, we'll manually attach the server, manually add capacity.

00:13:27

Speaker

So we had it all ready and configured and all that. yeah And then of course, we forgot to switch it off. So the peak happened, World Cup happened. Then after a while, TV is telling us, you know, we are getting a much larger build than we thought we would get. What happened?

00:13:41

Speaker

Then we realized we had taken all these servers extra and put them and we forgotten to switch it off. Because nobody was looking at the usage of the servers and all that data wasn't really available easily in those days.

00:13:51

Speaker

Correct. And so, of course, then we went and said, oh, we have to shut this So, observability gives you this ability to scale at very micro level.

00:14:04

Speaker

You know when you're about to run out of capacity and therefore you can quickly add capacity. Otherwise, the value of ah the ability to add capacity is useless. You don't know when to do it. And you can do it manually, but then you will be very inexact.

00:14:18

Speaker

So if you do it itll a automatically using data that the system is telling you, the system is telling you your user growth is climbing, then you do it at that time instead of two hours earlier or at the beginning of the day.

00:14:31

Speaker

And similarly, when the system is telling you things are not doing very much, then you de-grow. And all these adjustments can be made in real time using fairly simple scripting. No AI is AI actually. Just very straightforward. AI helps, of course, in improving things. But, you know, even well before AI, we were doing all this using simple scripting, using observability triggers. 80% of it, add a lot more capacity.

00:14:56

Speaker

somewhat more So that rules are alerts sort of thing which can be attached to a script, simple script. Kibos, you bring up new services. How microservices fits in is, you know, now we have better. if you If you architecturally ah constructed using individual independent floating services,

00:15:15

Speaker

then you're able to observe each independent service ah separately as opposed to saying CPU is occupied, you can say too many people are logging in, which is a much narrower and much simpler thing.

00:15:28

Speaker

and Earlier we used to say, you know, we don't know what's running inside. We know that the box itself is taking up a lot of CPU. Today we say, no, no, too many people are logging in. So give them more machines. Not that many people are doing shopping cart.

00:15:40

Speaker

So yeah you need to give them less resources. So the balancing becomes a lot finer across individual services, rather opposed to just a you know ah very coarse upsizing of the machine itself.

00:15:55

Speaker

Very true. So, see, microservices also brought in some sort of um code complexity, reduction on the code complexity where you are reducing your code for a service into a smaller chunk and um make sure that the service does only one thing and it does that one thing very well.

00:16:16

Speaker

and e um That's where the architecture part comes in. you know Not everybody does microservices this way. Correct, correct. yeah hello So that is that is any your big challenge actually where yeah ah the service is being called and you do not even know what is this what is the service being called for.

00:16:31

Speaker

Even that the workflow says that this service should not come in picture. There are a lot of cases we have seen in customer environment when we show them the service map. right so Why this service is getting called for fund transfer? This is not supposed to be even called.

00:16:45

Speaker

But somebody has written and something. You have to follow a certain amount of discipline to encapsulate the service entirely in its own context, which many don't follow. There a lot of under implied tight coupling. So yeah it's a it's a difficult job to do a very good quality microservice architecture.

00:17:04

Speaker

It's not a trivial job. Right. So one of the things, Shanky, what started happening is when this whole e evolution happened, a lot of people started the claiming that the earlier monitoring tools, per se like NPM, APM, ah server monitoring, because now we are talking about Kubernetes, microservices and so on, may not be applicable directly.

00:17:26

Speaker

right ah Because APM requires, because the services are coming up and going down, like you said, right ah that service itself may not be there when when you are trying to monitor things post some time.

00:17:37

Speaker

right So they said, one if win stand we were, we moved to Finicle, in RBL when my last tent, we used Finicle and then we moved to a containerized version of Finicle.

00:17:49

Speaker

Now, Finicle used to come with AppDynamics, but it did not come with container observability. came with server observability. And you know, it's not that the data is missing. The data is available, but you know, the models are missing.

00:18:04

Speaker

of How to take a look at it, how to visualize it on the screen, that was all built for servers. And so we actually had a pretty hard time initially. in figuring out okay is this service working or not because we can't see that particular container whether it's working or not. We can see the forest of containers, yes.

00:18:20

Speaker

You have 20 containers at 50% utilization. But you can't really see what is happening within a container, what kind of... If a container fails, you don't see it. You see it only when a server fails.

00:18:31

Speaker

Right, right, right. So ah this is where this whole observability thing start coming up in US, right? That they say that ah a monolithic type of application, you can monitor using an APM while a ah microservices based application or cloud native application, you cannot really use APM directly.

00:18:51

Speaker

Because the services are very ah quickly can come up and go down. And it's very difficult to have an APM agent installed and then monitor them. oh And that's when people start moving towards observability where they bring in tracing, logs and everything into it.

Transitioning to Observability Culture from Traditional Monitoring

00:19:07

Speaker

right Now, ah one challenge which we are seeing in most of the customer environment right that people have been doing monitoring.

00:19:14

Speaker

Now we are basically talking about observability. How have you seen this transition happening from something very reactive to a newer technology which can make you more proactive? So, monitoring is like, you know, i told you, you know, that bank that says we'll wait for a complaint. Monitoring is largely alert-based.

00:19:32

Speaker

Right. And most alerts, know, I've done this in a lot of companies and it's, you know, many people have thousands of alerts. They configure very broad parameters.

00:19:43

Speaker

And so you'll see lots of red and yellow and red and yellow. and You have a forest of reds. The actual reds are all invisible. You know, cry wolf kind of story. So many people miss the alert and then things go badly wrong and then, you know, then complaint happens. So then somebody reacts, even though they've had full monitoring, you know, fair amount of expensive monitoring.

00:20:03

Speaker

Second thing is monitoring as a culture. You know, though they though many come, and this I'm telling you as a generalization, but I've seen this many times over, so I suspect it's a valid generalization. Most companies have monitoring teams that only look at some very narrow things.

00:20:18

Speaker

So you'll see monitoring teams say system is green and customer says system is very bad. Because they say, according to these three things, my system is green.

00:20:30

Speaker

There is no... um the Firstly, monitoring teams and the and the application development teams and all that never meet each other. Yes. a very common cultural problem that the monitoring team looks at the alerts...

00:20:42

Speaker

has his own review within its own world and then sends only a digested report, a PowerPoint to the rest of the world, to the rest of the company, not to the outside world. yeah The application team only sees the digested report, which is largely good news.

00:20:57

Speaker

So it is, for instance, if they know an API is failing, mean if an API is failing, the monitoring team knows, but the application team doesn't know. So it never improves that API. yeah Correct. This thing is a very common problem. And I'm surprised at how common a problem it is because you would have thought it was obvious, but it's not because the organizations are created in these silos.

00:21:18

Speaker

Right. The silos tend to work within themselves. It is not in their interest or in their structure to go and talk to somebody else. right So, you know, one of the one of the whole cultural shifts from from monitoring to observability. What does observability mean? Who is observing?

00:21:35

Speaker

It is useless to have a security guard observe that, you know, robberies are happening in the next house, right? It is the house owner who's worried. The house owner doesn't have any visibility.

00:21:47

Speaker

So how can he observe anything? he He has no access to the dashboard. He has no access to the raw data. Right. And ah so, big part of this cultural shift towards observability, again, the fintechs do it or the e-commerce guys do it much better.

00:22:01

Speaker

The e-commerce guys, the sales guys worried about how many sales are happening. You know, Amazon reputedly measures their turnover as pixels, rupees per pixel.

00:22:13

Speaker

They have that screen. Just like a retail store observes, you know, retail stores are very obsessive about data. Right. Physical retail is far more obsessive about data than let's say banks are. physical stores need really measure sales per square foot.

00:22:27

Speaker

Sales per square foot, footfalls per square foot, profits per square foot. And they're obsessive about it. i mean, there's a lot of data that everybody analyzes all the time. So, you know, all these are early BI use cases, the analytics use cases all came out of Walmart and this and, you know, retail stores, Target.

00:22:45

Speaker

they They contracted the IBMs of the world to produce all these yeah models of data. they They have cameras following people around. Things like that. They do that tick, tick, tick, tick thing when you're getting into the... All that is data that is input into actual decision making.

00:22:59

Speaker

Right. E-commerce similarly inherited that culture that I want to see everything. Now they see everything is digital and suddenly they realized all that, you know, it's a lot richer than then the physical world because you can see far in far more detail what happened.

00:23:13

Speaker

Right. so Analytics does that. But, you know, many companies are not used to this culture of using analytics to see things. Now, of course, this is in the business side.

00:23:25

Speaker

In the application development side, analytics also tells you what to build, what not to build, how to build, how to improve. You know, IFL, for instance, we realize 60%, this is a long time ago, of course, 60% of our um customer queries were password reset on Mondays.

00:23:42

Speaker

Because SEBI has a very strong password reset rule. Every 15 days you must reset your password. So people reset it, forget it, come on Monday to office and realize they forgot their password. So they would call complaints saying password reset.

00:23:53

Speaker

So once we did self-reset, 60% our calls went away.

00:23:57

Speaker

But again, you know, I feel in those days it was small company. So the teams used to sit next to each other and we would interact a lot. But many other companies like the call center and the complaints team don't even sit in the same room or the same building as the as the application development team or even the same city sometimes.

00:24:15

Speaker

And they are not structured to cooperate with each other. they are not There is no organizational structure that allows them to interact. So therefore monitoring happens, not observability. Right, right.

00:24:27

Speaker

So, one of the things which you mentioned actually have also been created as a theory, that watermelon theory people call it, right? That everything from outside will be green, but as soon as you get in, everything is red, like what you talked about early on, right?

00:24:43

Speaker

Because on the average, it may well be green. So, you know, again, if you look at any monitoring screen, if you go into, the if you take a time period that is long enough, everything will look green. Or if you take a time period that is short enough.

00:24:55

Speaker

Okay. And, but if you dig into the details, you'll see lots and lots and lots of reds everywhere. It's very rare that an organization, unless it has a good culture of observability and not monitoring, is all green.

00:25:09

Speaker

But actually it should be, you know, when was the last time your Toyota car failed even in small things?

00:25:15

Speaker

No, I think ah somewhere ah ah the the way you are talking about the cultural part, right, of this whole observability, I think, ah ah just correct me if I am wrong in understanding what you're trying to say, that one is that what is happening with their application,

00:25:31

Speaker

and then connect it back to the developers right to to basically get a view in terms of what is failing, how it is failing. And then they basically be able to push the newer updates and so on. And then again, sort of an iterative process.

00:25:44

Speaker

But if you have to talk about culture of observability in one any of your previous thing, how do you set it up? I mean, for ah people to go about it. I've been doing a lot of that. Now that I don't do all my compliance meetings and my strategy meetings, I do consulting work for this only. This is my favorite passion topic.

00:26:04

Speaker

You know, how to improve systems, especially digital systems. Why is an Amazon so much more reliable than an RBL bank? Right. Or for that matter, my mother's blog is more reliable than, you know, commercial websites that many ah enterprises produce.

00:26:18

Speaker

So this has been a big passion of mine. You see, and I and i realize there's a few steps to this. First is, you must, and this is what I do at the beginning of every one of my consulting projects. You must force people come to the table and observe.

00:26:32

Speaker

Application guys tend to say or infrastructure guys to tend to say yeah looking at the Dynamics data or unit data is monitoring team's problem. Right.

00:26:43

Speaker

But actually it's not their problem. Monitoring team is is's not producing anything. Who is producing? The application guys. So if the application guys don't know what's going wrong, what is the point of what are the point of that data?

00:26:56

Speaker

It's like if the sales guy don't know what is selling, then what is the point of the data? yeah Some security guard knows what is selling, but that's not useful. so back So I start putting up these weekly reviews saying you must meet once a week.

00:27:09

Speaker

Firstly, um everybody in that whole team, that all the five or six verticals that are involved in this application, they must have access to the data, which is surprisingly not common. People tend to restrict access because you there's no security implication here. It's all read-only data, nothing. But it's just people are always you know they' are used to siloing data.

00:27:30

Speaker

So the first thing you do is it'll take you, I mean, an average organization two, three months to get all these permissions in place for everybody to be able to see the data. And observable data is entirely passive. is what It is about history. So there's no security implication.

00:27:44

Speaker

Correct. In the sense, you can't fool around. You can't make a mistake and alter the data or something. Nothing like that. Right. So, it's usually pretty easy to get the exemption, but it takes some time. So, that is one.

00:27:55

Speaker

To create that layer of permissions that all the stakeholders, therefore, are able to see the same data. Second is, once a week, 15 minutes. I need people to see the data.

00:28:07

Speaker

not the Not the presentation, not the spreadsheet. I want people to see the actual screen. So, you can drill down and see, how red is it? Which is again surprisingly less common. You know, one of the meetings I went to, senior vice president.

00:28:20

Speaker

So I asked him, can you see raw data? Actual screen here, AppDynamics, or someone. I don't remember which one. ah Okay. He said, yeah, yeah, senior vice president said, junior vice president, call him. Junior said somebody else. kobola It turns out some guy in Bangalore has access. nobody else Nobody in the whole room of some 812 people have access.

00:28:38

Speaker

Because, you know, they think of it as monitoring problem.

00:28:42

Speaker

So it's like, you know, if I attach a heart monitor to myself, I am the patient, but only the whatever the whoever I bought the monitor from is seeing the data. So so ah every as i once a week, 15 minutes, you need to see the data and use that data to come arrive at decisions. Eventually, you must get to all green.

00:29:04

Speaker

There is no excuse for an application that is not all green. yeah It's just poor quality if it is not green. So even if there are small errors, remove it. Take your time removing it. You don't have to drop everything and remove it.

00:29:17

Speaker

But you should not have it. You know, why should you have any error in your system? And you know, never know. Actually, small errors are the one that snowball. You know, when things, when growth happens or multiplication happens, it is the small errors that blow up in your face.

00:29:33

Speaker

Things that you thought was no big deal. And so therefore, eventually you must get to an all green setup. And that is the first step on observability. The people who are going to make the change need to observe the data.

00:29:47

Speaker

So that is just the error part. But also, you know, you can use the data to improve things. Right. If some feature is not being used, either run a campaign or drop the feature. Right. Things like that.

00:29:58

Speaker

So that, so as part of the square inch kind of thing, you need to start thinking of actual business outcomes to all your decisions. What does the data tell you? Right, right, right. And I came to the banking world from, I came from e-commerce. I didn't come from banking.

00:30:14

Speaker

So when I first came to RBL, RBL was my first banking CIO job. I used to say, yeah, the app big. Tell me how many people click it. And they would say, no, we don't know.

00:30:25

Speaker

Right. that the data itself basically was not there. So that is another thing which I think this exercise would do, right? That once you are ah in a room, everybody's there, people may ask, okay, can I see this data, right?

00:30:39

Speaker

Maybe it is a failure. Can I see failure response with what all response code it failed with? The transactions are failing with. It's a simple question. And go to which line, what failed, which line, which API failed, all that.

00:30:52

Speaker

So initially they start doing that in the meeting. But eventually it starts becoming a culture because they can now see the data even outside the meeting, right? They can see it 24 by 7. So they start doing this outside the meeting.

00:31:04

Speaker

And once the errors go away, they start using the data to say, okay, now this feature, why is it not being used? Is it a useless feature or is it a feature that's useful, but people can't figure it out?

00:31:15

Speaker

Right. You know, we once realized that we built this e-commerce site. We made it look like Amazon. It was, you know, quite successful in its own time. And we had this problem that nobody would go beyond the first few, whatever was shown on the first page would sell very well, but nobody would go much beyond that.

00:31:33

Speaker

So we started seeing what the data was. So eventually we started seeing hover data using hot jar was a, even still around. It's a famous tool. It tells you where people are positioning the mouse pointer. And we realized, you know, we had this tree with plus signs, you know, the standard my Microsoft Windows kind of tree.

00:31:51

Speaker

And nobody was actually hovering around the plus signs. So we started thinking, why not? I mean, it should be obvious that it that it there is an expansion. And then we went and asked people. They said, no, we thought the plus sign was decorative.

00:32:05

Speaker

Okay. So, but the data told us that that is the question we should ask. Without the data, we would never have asked this question. Correct. Correct. Correct. hey So that kind of thing is what happens eventually. When the error problems start going away, then they start thinking, okay, now the error is not there, but I'm still looking at the data. What can i what else can i see Right.

00:32:26

Speaker

Then it becomes true observability where, as I said, I come ah ah spend a lot of time in retail and e-commerce where people genuinely obsess about data. They're like, go and look at category data. Did he just move it or did he put it in his cart?

00:32:40

Speaker

Did he put it in his cart? Did he put it back into the shelf? What did he do? Right. So that's not about errors anymore. That's about starting to use that data to improve your product, to maximize your product.

00:32:53

Speaker

Correct. That's also become like a sort of ah user experience, how what user is able to do and how fast you are able to give those responses and then ah without failures.

00:33:03

Speaker

very an outba editor you full You know, AI has become very good. So there are tools that tell you furious users. as in news They have not complained, but they've actually been pretty upset or irritated users or, you know, merely...

00:33:18

Speaker

Whatever. Unappreciative users. They can tell you this. but these big These users seem to be very furious. So you go and ah examine what they did and why are they ah frustrated users. Not furious. Frustrated users.

00:33:30

Speaker

ah Okay. Go and figure out why they are frustrated. Right. Right. And so that really helps you improve the experience. Because you start obviously by removing frustrated users. You don't start with, ah you know, mildly unhappy users.

00:33:43

Speaker

Right. Right. And as part of these meetings, which you basically ask everybody to come, do you also sort of review if people basically have configured the right alerts, thresholds are right?

00:33:56

Speaker

Is it too much of alerts coming in? Is the right alerts getting generated and sort of discussion around that? Yeah. So three major discussions happen. One is Once you start seeing the red alert, people will say, oh a yeah we just ignore it.

00:34:10

Speaker

So the rule is, if you are just going to ignore it, it is not an alert. If it is a red alert, it means it's a red alert. It means you're supposed to you know rush and fix things. If you're not planning rush and fix things, why are you classifying it as a red alert?

00:34:22

Speaker

So that is the and big thing that happens. Second is, one one group will say, no, no, I keep facing issues. And the other group will say, but look, there's no alert. Everything is green. So then something is misconfigured. You have not observed a variable that you should have that could have seen the error.

00:34:38

Speaker

Correct. So this is all in the failure part. ah This is what happens in the first few six months or so. People focus on this failure to improve the... monitoring itself.

00:34:49

Speaker

Right. You know, should you monitor, for instance, this variable instead of that? Correct. And how much weight should you give that variable? Is it critical to your business or is it just okay, okay?

00:35:00

Speaker

And people eventually start thinking in financial terms, especially if you're on the cloud. You know, all these has financial, you can directly tie it to a cost. So, you know, if like banks, transaction fails in a bank.

00:35:15

Speaker

we can actually figure out what the cost of that is. What is the cost of that failure? Right. How much revenue did I lose? How much, you know, did I have to pay a fine or something? Penalty. And therefore, I start thinking in terms of not just failures in the abstract, but failures in the impact to business.

00:35:33

Speaker

So if there is a failure that has no impact to business, I'm happy to ignore it and whatever, remove the variable. If a small failure has a very high business impact, then I'm happy to increase its weight and look at it better.

00:35:45

Speaker

Right, right. And ah ah when when you are having these discussion on the cultural part of it, right, ah is there any specific thing which you do on the development side also? Like, obviously, one is that the problems which are coming up now that the developers obviously need to fix it.

00:36:03

Speaker

But is there something, ah sometime back we were discussing or talking about shift left, for example, right? Right. where we are saying that bus ah when you are started developing, think about how do you observe it or what sort of thing you need to make sure that it is more observable.

00:36:18

Speaker

right So are you seeing some of these cultural aspects coming into fact? Finally, it is the developers that have to observe. you know and Anybody else observing, an infrastructure guy or the monitoring team observing is not really useful.

00:36:31

Speaker

i mean, they can do temporary fixes. Okay, you know, like emergency room, ti I'll give you a shock and start your heart. But finally, I need a proper medical care, which means developers have to improve the system. So the observability journey really ends with developers observing and therefore using that observation to continue to improve their systems.

00:36:49

Speaker

Either fix, fix an issue is only when the errors are large. But when they start seeing it again and again, they actually start thinking, this is how the system behaves. System behaves, this is how the usage of the system is, as opposed to the design of the system.

00:37:04

Speaker

And that is why, that is how, so if you think of car design, for instance, it's heavily dependent on how people use a car. You know, the original cars are very uncomfortable because they were designed by people who thought cars would behave like this.

00:37:18

Speaker

But as they started getting usage information, they started realizing many of these things can be put in, cars can be made much better. So car engine technology doesn't progress that much that quickly. But cars have become a lot nicer since, you know, the 1970s. Right, right.

00:37:34

Speaker

Okay. So similarly, the journey must end with application developers. As in the application developers must start observing the system. To say this is how I'd use it, therefore I'll build it like this. Right, right.

00:37:46

Speaker

And two things are key to this. One is build less. Most people overbuild. So observability tells you whether it's useful to build that or not. So you should not, for instance, like I make 50 API calls to figure out everything about my customer and then do this and that.

00:38:04

Speaker

It turns out nobody goes to those branches which require those 20 API calls. So then why have 20 API calls and that branch at all? That observability will tell you, can nobody goes to that branch. It is not useful. It is not generating any revenue.

00:38:17

Speaker

And therefore, I start forcing application guys to model what the cost of each standard transaction is. Right. And ah and given when these developers are looking at the shift lift part or embedded observability part and so on, one of the things which you and me both ah had discussed earlier also about the logs part, while we do get ah how people are using the app, where they are clicking, what sort of drop-offs are happening, how how many transactions are coming in, how many are becoming successful and all.

00:38:53

Speaker

But these are still machinic data, right? But logs is something which we had discussed long time back also that it provides a lot of contextual data. Like what this transaction is about, but what is really happening within the application.

00:39:06

Speaker

Can you know talk about a little bit about what your thought process is? Actually, fairly recently, enterprises worried only about instrumentation in the sense of I want to see if there is usage data, hardware data, SNMP, network data, which where it which is all data for the underlying system and not about the transaction itself.

00:39:30

Speaker

oh The transaction produced transaction logs, which would get stored primarily for forensic purposes. They would not actually be analyzed. So the first I call ops data, you know, the data that is required to operate the machine.

00:39:47

Speaker

The second is business data. The transaction actually is what is, that's what the machine is supposed to do, right? So the logs of a transaction are the data for business.

00:40:01

Speaker

Right. But technology plays a big role in making that data visible for business. or so Or other, like InfoSec, for instance, makes a lot of use. So, you know, again, e-commerce world, we realized web server logs are a goldmine.

00:40:19

Speaker

So there were whole companies that came up to analyze web server. Before Google Analytics, there was another company that Adobe bought, which even as a small e-commerce company, used to subscribe to at fairer expense because that would give us so much idea of how users behaved, which has nothing with failures.

00:40:35

Speaker

Like, it didn't tell us anything about failures. But it would tell us, of course, where things stopped. And then we could go and figure out, what are this page is not loading. That's why things stopped. But by and large, it will not tell us that something has failed or not failed.

00:40:47

Speaker

It would tell us where what the path was. Like this hover data I told you, that is hot jar. You know, you would hover on ho on a particular point, which means either and you can make out hover patterns also inci in undecided pattern, unconcerned that the mouse is just there and nothing is happening.

00:41:06

Speaker

And you could and ah do a lot of ah analytics from that to see if your product is good or not, if your transaction is going through or not. So again, Amazon was the pioneer in this. Amazon said, once you've decided to buy, i will have the lowest friction to getting you to pay.

00:41:23

Speaker

Pay, correct. It ends at payment. Okay. So, they would obsessively observe this funnel. Funnel is all from log data. i mean, all from transaction logs. It has absolutely nothing to do with infrastructure.

00:41:36

Speaker

Right. And we would obsessively observe the funnel to see where the drop-off happened, what percentage happened where, you know, and is this normal or exceptional and so on and so forth. That funnel, in e-commerce, the funnel used to be our life. Now, of course, there's a lot more.

00:41:52

Speaker

Correct. Correct. But, and then where the people came from, you know, how, what is your marketing campaign to sales ratio? It's not something even directly observed, but you have the logs up above different times. No, I sent out a marketing campaign this time and this is what purchase logs happened.

00:42:09

Speaker

Correct. In NRC, for instance, we used to use log data to do fraud analytics, circular trading, what Ketan Pari has been caught for. Things like, a ah so that is all comes from the transaction logs.

00:42:20

Speaker

They don't come from infrastructure. Right, right. And ah in the API world, the logs have become quite richer because now each API call has a payload, it has a context, it has all kinds of things.

00:42:33

Speaker

So in the API world, as as so you know a transaction zips through multiple APIs, you get a lot of information about what happens. Right. And the stitching also can be done if you look at it because each of the places you have the transaction ID and and whatever information you need.

00:42:49

Speaker

And this, of course, capital markets is really far ahead. Capital markets have been very, very, ah you know, advanced at taking transaction logs and using it. So almost all algorithmic traders, especially high-speed traders rely on transaction logs to decide what to do next.

00:43:05

Speaker

They rely on the timestamps within a transaction, what came back, what is the order book position, all that stuff. Because all that comes back in the transaction and they use that to figure out how to take the next step and all that in microseconds.

00:43:21

Speaker

So capital markets has always been a ahead on this. And similarly, NSE tries to figure out whether you're trying to refraud, front run, this, that, all that in in ah it through the logs. Right, right.

00:43:33

Speaker

In all your consulting assignments or whatever ah you have been working with customers on this ah observability side or health and performance of systems, right? ah What are the, i mean, we obviously talked about a couple of pitfalls, but what are the most common pitfalls you have seen people are not doing right?

00:43:52

Speaker

So one is by far the most common is the alerts are set wrong. And therefore, people just ignore it. ki yeah so This alert is not really an alert. So, the nice problem is tens of thousands of alerts.

00:44:04

Speaker

It's not like a small problem. Any company, in fact, it's one of those low-hanging fruits. I get paid a fair large amount of money to say something very obvious. I'll go to a company and say, you have alerts are wrong.

00:44:16

Speaker

And invariably it will be. There will be lots and lots and lots of alerts which are, you know, completely useless in the sense that, but default mc cur the eye like 70% CPU utilization. He's put an alert there saying, so what will you do with that alert?

00:44:29

Speaker

ah No, we ignore it actually.

00:44:34

Speaker

so and So, and even in the companies I work with, we've had to spend a lot of time and effort removing those alerts. and it So that you get to ah you get to a clean setup.

00:44:46

Speaker

So that when something is actually an alert, it's like, you know, oh health. You need to worry about, you know, if it's a red alert, then you need to worry worry about it immediately. You can't say, it's not really red.

00:45:00

Speaker

If it's not really red, you shouldn't be having an alert. Correct. In fact, I was talking to somebody and they did this exact exercise. I you can't remember the company name, but ah what he was saying is that they did this exercise for three months.

00:45:13

Speaker

Anytime an alert came and if nobody reacted to it next five minutes, we will disable that alert. Correct. And that's how it should be. If it is a red alert and nobody is reacting to it, you know, it's a false alert.

00:45:27

Speaker

Correct. And everybody knows it's a false alert. So, therefore, you can disable it without too much concern. The second biggest thing is this siloed culture, where only the monitoring team sees the raw data, everybody else sees digested data.

00:45:39

Speaker

And the digested data is invariably good news. yeah and is Yeah, average round and showing more green than the red. And I think there are some political issues also here. Do you think that? Yes and no. I mean, obviously, the guy who's built the monitoring team, who under whom the monitoring reports, which can mean the hundreds of people, right he's ah reluctant to give it up.

00:46:01

Speaker

ah He feels that if everybody watches the data, what will happen to the monitoring team? And that is actually, there should not be a monitoring team. Just like in application development also, um there is this theory that you should have a strong QA team. But if you look at the journey of application development or cars, like Toyota has no QA team.

00:46:23

Speaker

Because if a car is defective at the end of the process, then your process is wrong. It's not the car that's wrong. So Toyota really doesn't have a QA team. it has a It has QA teams at every stage. Like, they have this famous thing know, that anybody can stop the entire supply chain, one screw will lose. Any worker can.

00:46:43

Speaker

That is the culture. That is not the culture in India. You know, a worker stopping the supply chain, he is probably thinking, though not only will lose my job, entire family will be thrown out into the streets.

00:46:54

Speaker

Right, right. that and culture has to come. That quality is important at every stage. Similarly, this culture of full green has to come. that there should not be any red.

00:47:07

Speaker

If there is a red, I must react to it and remove it as quickly as possible. yeah And it cannot happen, therefore, at the monitoring team. Increasingly, in fact, AI makes it so that most of this can be automated away.

00:47:18

Speaker

So again, NSE is a good example. You know, most we have thousands of alerts a day. But it's almost all automated. Temporarily, it will become yellow. And then, now so everybody looks at it. Yellow And then they it will go away because there's some automatic monitoring remediation.

00:47:34

Speaker

NSE, of course, is further ahead on this curve and it has spent a lot of money fixing it. Correct. Because NSE, you know, one error, the is it blows out to such a level that the CTO has to lose his salary.

00:47:46

Speaker

So, therefore, NSE is very, if think of it. NSE does about, nowadays, it does about 15 odd billion transactions a day. So having even 10 errors a year is incredibly tiny.

00:47:59

Speaker

But and it that's why it has put a lot of time and effort into building these automated fixes. But the same technology is available everywhere. It is very rare. For instance, Amazon doesn't have any monitoring team, doesn't have any, you know, team of specialists around looking at the system and fixing it as it goes along. Right? It's almost all automatic.

00:48:16

Speaker

If it wasn't, they couldn't manage so many servers. Yeah, yeah. so The same technology is available internally, but that is the next culture. And that ah that the monitoring team has to give up its existence.

00:48:29

Speaker

And that politically speaking can be quite tricky. Because the guy who heads the monitoring team can be potentially quite politically powerful. Correct. Correct. So ah ah just extending this to ah the other side, right, that ah we are also, ah I mean, I heard a lot of people talking about no ops, right? ah Basically, there is no ops team required.

00:48:51

Speaker

In fact, when we were talking to Dilip Azbe of NPCI, this was one of the things which he mentioned to us that, boss, we should look at, ah there should not be anybody who is gluing to a dashboard or a TV, right?

00:49:03

Speaker

and looking at your keyboards, what is happening in our system. There should not be any Opsim at all. It should be you generate an alert, you figure out this is a critical alert, this is the problem, go ahead and fix it and then things continue.

00:49:18

Speaker

so So I am hoping with this, some of these AI, Gen AI. yeah yeah So NoOps is actually a combination of three technologies that exist with some culture shifts. One is NoOps is created by the application developer.

00:49:30

Speaker

The developer can create a system that doesn't require operations. But to do that, he needs to auto-grow, for instance, is a good example. Now, who should do auto-grow? If you ask a typical traditional company, they'll say infrastructure will decide grow or auto-grow or manual grow. But why should that be the case?

00:49:47

Speaker

The application developer can build a system so that it takes advantage of Autogrow. Correct. Because Autogrow is not instant. today There's a certain delay. So you can set up the system so that the the triggering happens well in time and so on. it's very hard for the infrastructure girl to know all this.

00:50:02

Speaker

Right. So, again, in the dot-com of the startup world, there is no infrastructure person at all. You know, nobody is hired for infrastructure in many of these startups, including the startups that I have run.

00:50:16

Speaker

And so I've started discouraging people from hiring infrastructure guys. You know, RBL, we used to have a lot of stuff, 40% of our stuff on cloud and 60% on-premise.

00:50:28

Speaker

The cloud part had four people and all provisioning experts, nothing to everything. We had insisted everything be automated. Right. The physical part had some 200 people. And there again, there was a lot of automation that started happening because the tools are available.

00:50:44

Speaker

And again, there was political resistance. There was a lot of pushback saying, no, no, let's not automate, whatever. They'll give you other reasons. They won't say, I'll lose my job. They'll say, no, no, it's unreliable. It's expensive. It's this, that, in unsecure.

00:50:57

Speaker

Yeah. All of which is, i mean, what? You can work through it, right? What is unsecure about an automation? You can always make it secure. So, so eventually, but if you look at established companies like the Amazons of the world, they have 10,000 servers to one person because most of it is automated.

00:51:15

Speaker

Almost everything is automated. That one person is only when actual things blow up or catch fire.

00:51:21

Speaker

Then, so many of these, even car factories, you know, there are many car factories in China that don't have any lights. or they're run by robots. um So whoever comes into the car factory uses a torch everything is automated. Tesla is a good example. Everything is automated.

00:51:37

Speaker

Right. and So there's no ops and it's automated using like the, of course, the actual robot is moving but the fact is it's all, visibility is all through software. And the whole point of observability is you must do something with your observation.

00:51:54

Speaker

Otherwise, you're wasting a month. Correct. So, so far, lot of folks that monitoring people who have come from monitoring site to observability, they are still using observability as a monitoring sort of thing, right? Improved monitoring in the sense that it's still humans observing.

00:52:10

Speaker

Yes. But to take that data and write a script... As I said, NSE always scripts. I mean, it's just simple partial scripts saying this trigger happens, do this. Correct.

00:52:21

Speaker

but But that culture is still some distance away. right That's the no-ops culture. Correct. The other challenge with the no-ops culture is a lack of knowledge.

00:52:34

Speaker

And it's moving very fast. So actually, you know, if I were to write partial scripts today, people would laugh at me. At least the better i t guys would laugh at me because today there's much better ways to write automation scripts.

00:52:46

Speaker

Correct. Correct. So ah if let's say you have to suggest some IT leader that, okay, they have put up an observability culture and they want to assess it, the maturity of that observability culture.

00:53:02

Speaker

What are the metrics which they should look at? I mean, is it that they should do some sort of a survey or what? I haven't really thought about it, but of course, I mean, this culture, I mean, they should probably start by seeing how many alerts they have that nobody responds to.

00:53:16

Speaker

That is an easy measure to get, right? how many How many alerts are there on a day-to-day basis? Because if there are lots of alerts on day-to-day basis and their business is not collapsing, then the alerts are useless. You know, if I was, let's say, you know, I was a patient and my heart monitor kept going, backpa but I didn't die, then I eventually just ignore my heart monitor.

00:53:35

Speaker

Got it. Got it. So something like that. ah So that's a good metric. Second metric is um this, you know, they can figure out how many cross-functional conversations happen.

00:53:48

Speaker

Right. How frequently does it happen?

00:53:52

Speaker

right I don't know. i haven't really thought about it. Okay. Okay.

Metrics for Assessing Observability Maturity

00:53:56

Speaker

But the thing which you talked about early on was starting from ah looking at the low-hanging force like number of alerts generating but nobody taking action.

00:54:05

Speaker

Then whether all different teams like developers, QA... oh Simple thing. How many people have access? like The more people have access to the observability data, the better your culture is. Yes. Because, you know, given that it's passive data and entirely, but you know, entirely historical...

00:54:22

Speaker

Ideally, everybody should have access. There should not be any particular restriction because there's no security requirement to having restriction on the access. Not having access. Correct. correct And moreover, finally, the last thing is that ah how many people, maybe different groups, whether it is developer or business or QA or sari or whatever, right, are taking action to improve the overall ah application of how much it is to measure the DNR.

00:54:46

Speaker

That may be pretty hard to measure. right But what percentage of the company has access to observability data might be much easier to figure out.

00:54:57

Speaker

Because metrics, otherwise you'll have to find a good proxy. For instance, how many people took action, you'll have to find a good proxy to it. How many change management requests specify observability data as the trigger?

00:55:11

Speaker

Something like that. Correct. Correct. Correct. In this whole observability part, right? There is another term which I just mentioned SRE, which keeps coming up ah in most the places, right?

00:55:24

Speaker

oh How does ah s sa team basically has been using observability data because SRE again is a very larger concept. Observability is one part of their work, right?

00:55:36

Speaker

But SRE is system reliability engineering. is basically about design of systems that are reliable. You know, again, I've spent some time on this. Think of, let's say, door latch.

00:55:48

Speaker

You know, our original Kundi door latch, you know, the chain with the hook. It may not be particularly fancy, but it is incredibly reliable. It can last a thousand years without too much change, right? and requires no maintenance.

00:56:02

Speaker

It is very hard to break down a door with that kind of latch. So it works very well. But of course it has you know other challenges. It requires, it will be heavy, it's ah the door is a little loose, this, that, even though all kinds of improvements to it.

00:56:17

Speaker

But many improvements, or even the sideways latch is pretty reliable. Very hard to fool around with it. But for instance, the friction latch has failure points. Eventually the rubber wears away so the latch becomes loose.

00:56:30

Speaker

And so that is what reliability is about. how How, I mean, by design, how is the system more reliable than... And there are some few basic, of course, it's a very comprehensive field.

00:56:42

Speaker

um But there are a few basic things. One is simplicity. The more simple the system, the fewer the moving parts, the more reliable it is. Second is the more observable, the and more reliable it is. yeah OPEX systems are always, black boxes are always unreliable.

00:56:57

Speaker

You don't have to fix it. No, no, and it's very hard for anyone to improve it because, you know, how would you improve it? You don't have the data. Right, right. So there are a few things like that. So that's where observability plays a key role.

00:57:11

Speaker

You have to design a system based on knowing what happens in real life. Because you can accidentally design a very very good system. Like somebody designed the wheel without having all this observability data.

00:57:24

Speaker

But the fact is that doesn't happen much in life. You have to design it, then observe what happens with it. Design it a bit more. Observe what happens with it. Design it a bit more. And then over a period of time, you get some really nice, simple, reliable design.

00:57:38

Speaker

That iteration is very important, right? I mean, getting the feedback, that is what the observability word also came from, that control system theory, right? That yeah um what your system's internals are, can you basically get certain signals to figure that out, right? That's how the observability thing also came up.

00:57:56

Speaker

And then it's all about iteration. Correct. Without observability, it's impossible to iterate. Yeah. If you don't know what's happening, then you will theorize and the theory may be completely wrong.

00:58:09

Speaker

Right. Right. Very interesting, Shanky. ah Shanky, just one thing I wanted to hear from you. How do you see this whole culture, maturity, observability part being impacted by AI and Gen AI? One part you talked about, about the automation and other things.

00:58:28

Speaker

But is there any other use cases you see Gen AI, AI actually bringing in here? So... The main thing is that, you know, actually many of these at this monitoring activity can easily be replaced by Gen AI.

00:58:44

Speaker

Of course, if you have bad alerts, you'll still get bad outputs. But the monitoring part that what today the human does can easily be replaced. And Gen AI, you can actually, the main challenge with automation used to be that writing these scripts and testing them and deploying them used to be quite tedious.

00:59:01

Speaker

AI can really accelerate that process. For instance, you know, nowadays you have these AI agents. Agentic AI. Correct. Agentic AI. Agentic AI is ideal in the ops world.

00:59:14

Speaker

It's very easy to write a very simple workflow and then put the rest in English and then the agent will interpret the English and do as necessary. Right. And the similarly, you know, some alerts are not deterministic. So there, an agent can come to the point where it requires human input, get the input in English and act on that.

00:59:36

Speaker

Right. Right. Right. eight Shaki, before I let you go, ah i have one thing which we generally ask all of our guests. ah What are the books which really inspired you or continue to inspire you, which you always go back to, or the book which you generally give to people?

00:59:54

Speaker

ah Do you have any recommendations? Nowadays, i mean, books, of course, I read a lot of fiction. I don't read non-fiction quite as much. I read non-fiction on blogs, not on books.

01:00:06

Speaker

ah Blogs and newsletters. So there's Lenny's Newsletter, which I read a lot. And there's Stack Overflow. I keep reading for various things. um There are quite a few. So Medium is another massive source of great information and good quality.

01:00:23

Speaker

Then um I read. So, and I mean, podcasts, of course, some audio podcasts also are very good sources of information for me. Right.

01:00:33

Speaker

And... Many of the, and I used to, i mean, nowadays I have done a little less of it in the last few years. i actually go to conferences and look at that partner network.

01:00:45

Speaker

You know, of course, the conference, main conference, let's say an Oracle conference or a Finical conference. So Finical and Oracle and all have roadmap or, you know, they have big announcements. But their partner network often has much better innovation.

01:01:00

Speaker

And that's how and their innovation in turn tends to have blogs and all that. That's a large a large part of my gaining knowledge. Today, books are falling behind. By the time you write a book, it's a bit obsolete.

01:01:12

Speaker

The banal are valuable, especially around ah team structures. In fact, there's one 1970s book, um um um Myth of the Man Month.

01:01:26

Speaker

Mythical Man Month. That's still around. It's 1970s book. It is still... Everybody should read it. Even now, companies are making the same mistakes. Correct, correct. ah The other book which I really like was High Output Management by Andy Grove.

01:01:40

Speaker

I mean, he talks about more from the manufacturing world, but amazing book. Then there was another an interesting book... yeah There's that Phoenix project. it's a it's It's told as a fictional tale of a CIO and a CISO and this and that. And it's quite fun to read. so i It's worth reading about the politics of different pieces. and Okay.

01:02:05

Speaker

Okay. Cool. I haven't read the Andy Grove book. I must read it. Yeah. High Output Manandran. Very nice book. Very, very nice book. It was an Intel thing which they were doing at the time. Odile has driven a lot of the concepts of management today.

01:02:20

Speaker

Right. got Cool. Thank you so much, Shaki, for your time. I really, really appreciate this discussion and a lot of stories from your experience, which I am sure lot of our viewers will find very interesting and insightful.

01:02:36

Speaker

Thank you so much for your time today.

01:02:39

Speaker

Thanks. Hope you found my discussion with Sanki insightful. If you did, please consider sharing it with your colleagues. For more information about VueNet Systems, please visit us at www.vuenetsystems.com.