Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Avatar
22 Plays19 days ago

EPISODE 1
Observability is Done. What Comes Next.

Observability won the last decade by giving humans more data. The next decade is not about more dashboards. It is about systems that understand themselves.

This episode lays out what SRI is, why it had to exist, and what AIOps is getting wrong. Not a product pitch. Not a founder origin story. A worldview episode.

Learn more

Transcript

Introduction and Guest Background

00:00:00
Priyank Upadhyay
Yes. Hey everyone. Welcome to the Root Cause. What a great way to start the show. And I am very happy to introduce my first guest.
00:00:15
Priyank Upadhyay
Today I have someone who I have known personally for a very long time. We met in college, have stayed close through about almost a decade now. And both of us trying to figure out what we want to do and build.
00:00:29
Priyank Upadhyay
He went the operator route, I went the founder route. And the reason I wanted him to be on the show is that because he has seen the problem.
00:00:41
Priyank Upadhyay
The problem that we are going to talk about today and from more angles than almost anybody I know would be able to answer those. oh
00:00:51
Imran
Yeah.

Incident Management Challenges

00:00:52
Priyank Upadhyay
For Imran, he started his career at TCS, took a swing at building his own company, then spent the next decade going deeper into infrastructure, reliability at GE Healthcare, Avesha, Razorpay, Firebolt.
00:01:08
Priyank Upadhyay
His career spans from, you know, Dev, to senior dev, engineering manager, and he built teams underneath him at each stop. And outside his day job, he is a Docker captain. He maintains Cube Slice, an open source project in multi-cluster Kubernetes space.
00:01:26
Priyank Upadhyay
And he has spoken at KCD Pune. So if you have run anything serious in production in the last 10 years, there's a good chance you have run it on something that he has worked on.
00:01:39
Priyank Upadhyay
Here, he is not present to talk about his companies. He is here as a friend and as someone who has lived the problems what this show is about.
00:01:51
Priyank Upadhyay
So Imran, welcome to The Root Cause.
00:01:54
Imran
Thank you Priyank. Thanks. Thank you so much. I think it's a long waited podcast. From last three years, people are asking you the same question when you're going to start it.
00:02:05
Imran
And here it is. And I'm honored to be your first guest. And yeah, happy to take it from here.
00:02:13
Priyank Upadhyay
Thanks, I know I was also waiting a lot for this.
00:02:16
Imran
Yeah, I know.
00:02:18
Priyank Upadhyay
Alright, so Imran, jumping straight to the discussion or topic that I wanted to cover is, across the 10 years of you operating the systems, GE Healthcare, Razorpay, Firebolt,
00:02:24
Imran
Mm-hmm. Mm-hmm.
00:02:38
Priyank Upadhyay
even your open source work and being a Docker captain, right? How many incidents did you see got a full root cause analysis?
00:02:49
Priyank Upadhyay
Not closed, not mitigated, fully root caused with a real honest understanding of why it happened.
00:02:58
Imran
So, man, the first question you asked it is very difficult because, okay, it's it's difficult to answer because Priyank, I think this is something in day-to-day in industry.
00:03:14
Imran
It's a very common thing. handle So when you develop something, when your software is on platform, the first thing you do, you try to handle this incident. You try to create some kind of incident management team, either your dev is handling or SRE teams and specialization. So what I'm trying to say is it's very difficult to answer. I have seen some days, 60 issues per day to 700 to 800 per month.
00:03:38
Imran
I have seen a lot. So But I can tell you from your question, you want to know how much is closed, like a full circle with the RCA? Yes, the number is very less. The count of the incident and derivative from the action items from that incident is less.
00:03:54
Imran
What generally I can tell you with this incident happens is we try to pass things up. So we fix it. It's not like we don't because that's exactly the business impact if you don't want to fix it.
00:04:05
Imran
But the answer to is, I don't know, but the number is very huge. And generally it's like untouched unattended. Most of these issues are unattended because you focus on something which is really causing the pain of your business. But generally this out of the 700, 800 issue, your monthly you are getting right now. You just, but it

Observability Tools and MTTR

00:04:24
Imran
someplace it is buried in the document.
00:04:26
Imran
umm I'm talking about not one or two companies, every company does that. So. Yeah, this dharma is very very huge which is really getting closed and the portion, it's more huge which has not even touched upon just apart from some firefighting done on that or some kind of a patching haven't done. That's it.
00:04:44
Priyank Upadhyay
Absolutely. i think I also agree a lot towards it because in my experience, I have seen the very similar aspect.
00:04:54
Imran
Yeah.
00:04:56
Priyank Upadhyay
But a lot of tooling has been there for quite some time. What do you think observability has already solved and what it has not yet?
00:05:11
Imran
so yes mean i think if you see this way i agreed i think from the start of my career i have seen a legacy trading system used logs as a txt file dumping on some space like storage which some remote server and really if you want to go and search just understand how difficult it will be to from like multiple everydays text file or every hour's text file you have to go and figure out what's happened So there are toolings. And frankly, I have seen those tooling build in the industry from scratch. Means people start using it from their own to go from the third party tools, which is doing a very fabulous job, no doubt on that.
00:05:52
Imran
so yes tooling is there and what exactly solve is this part so now if you want to know something there is two part of this one so earlier to generate logs and how it will going to be stored and all this it was difficult because you have to plan it you have to build something around this one you have to you have to have a own project to run to make sure the log is going in and right correct way this is solved now there is a place you can get everything you have your own retention policies you can get your older things That also gives us a flexibility for any small moment, not even win a big event. You push everything.
00:06:28
Imran
to this tooling system either it's an event or if it's a cloud or if it's some kind of specific notification you want to store for your specific times so it's easy now this is solved the visibility in a single term i can say the visibility is much much better right now now you have something where you can see things which earlier was very difficult to manage so i have seen those days and this is the difference i am seeing right on industry
00:06:53
Priyank Upadhyay
Makes sense. I think the, in my opinion, the main objective of observability tool was to make sure the MTTR is being reduced.
00:07:11
Priyank Upadhyay
However, that has not happened in my opinion, because you do get visibility, but You are also getting more code now in the system.
00:07:21
Priyank Upadhyay
So it becomes much more challenging.
00:07:22
Imran
yeah
00:07:24
Priyank Upadhyay
But yeah, I think it's it's a very structural problem and solving it actually is required.
00:07:33
Imran
So i

AI's Role in Incident Management

00:07:35
Imran
I want to point some specific, like I want to focus on this topic right now. You said MTTR, right? So MTTR is something which was pre this cloud era software SaaS tool. I think this is still exist in the industry.
00:07:52
Imran
This is a very important metrics which owns by the engineer, but the impact is on the product. In a simplified way, if this is not even improving in your company, that means you have more downtime. That means your customer expenses or experiences are going in, I don't know in a cutter and moreover, like you're losing money.
00:08:13
Imran
So understand this. So this MTTR with this tooling, it helped. Now you know where to go and search, but the question remains same. There is a person who need to understand where to search, what to search, and what's the time span I'm really looking for, right? So it's not the problem is not solved. The way earlier people were doing, the mode of doing is changed because of the new observability in the market.
00:08:41
Imran
And this MTTR, I think I love to know more about also because i have read i have read your article about the MTTR and I have written written a post in the LinkedIn also because i I discussed with you last time, right? 15, 20 days, I think back we were discussing. So we always see this like resolvation because that's really matter for the business of product when it's getting resolved.
00:09:02
Imran
But the reality is if we don't understand what's this, what's this, well how is it going to be solved? How is it going to be resolved? it's It's very difficult to resolve itself. So the MTTU is something which I saw a very good potential and I really like to know, do you see this is really working or is industry still chasing right now on the MTTR side? How do you see this?
00:09:23
Priyank Upadhyay
Yep. See, the the concept of MTTU came in when I was building Rubik's Cube and a very early stage actually.
00:09:34
Priyank Upadhyay
The objective or the North Star was how I can reduce the MTTR because that's the ultimate goal right of any kind of tooling like this.
00:09:42
Imran
Yeah.
00:09:43
Priyank Upadhyay
From the onset of an issue till it is getting resolved, there are multiple phases of a life cycle of an incident that happens.
00:09:53
Imran
Yeah. Yeah.
00:09:54
Priyank Upadhyay
There is this detection part we often refer to as MTTD, mean time to detect. right There are a lot of toolings which enable us to detect things faster and get us an alert so that you know we can start working on it.
00:10:09
Priyank Upadhyay
We acknowledge that incident and then we move towards solving it. What I only did is just break down break down this whole MTTR cycle and identify the MTTU phase which mean time to understand because this is the longest phase wherein you are investigating, you are asking people, you are first of all finding people who you should be asking questions to and then going after them, chasing them, getting a response from them
00:10:30
Imran
Mm-hmm. Mm-hmm. Mm-hmm.
00:10:40
Priyank Upadhyay
trying to debug things via observability tools and telemetry and metrics and other things. And then once you understand the solution part is very easy.
00:10:51
Priyank Upadhyay
There could be just a four line command that you need to run. There could be a code change that that would go live. It may require two days, five days. But the longest time or the struggling phase of an SRE person who is responsible for reducing this MTTR is this phase of mean time to understand.
00:11:12
Priyank Upadhyay
And that's thats see that is the portion which I am trying to solve
00:11:13
Imran
Yeah.
00:11:18
Priyank Upadhyay
by giving Rubik's cube that understanding and of whatever a human being does, basically is the similar thing in a more agentic or optimized autonomous way. That's all.
00:11:32
Imran
No, I always see, see I always admire new tech. That's why i think i you also remember me being the part of the open source because we understood this is something which is going to change the industry, right? not Not always the propriety software, the open source someone can use.
00:11:49
Imran
When the AI came, i I absolutely am sure the things which we do do day-to-day basis, which is very manual in task, which is boilerplate, it's no need to be done. I'm 100%. I'm like my all hands to the deck. Like,

AI Adoption and Trust Issues

00:12:03
Imran
yeah, I can solve it.
00:12:04
Imran
If you see the writing codes in the day to day out, some company claiming 70%, some are 80%, some are 90% of their production code is like from the generated by the AI. So yes, I also believe there is certain extent we will require AI in our production.
00:12:21
Imran
Though as the part from the engineering side, I'm a little expectable also. like Probably it can happen. I'm hearing a lot of bad news in the industry, right? Someone getting database deleted, some table got deleted and all.
00:12:36
Imran
So I know this will going to come. The only thing is i'm I'm still trying to see from obviously a lot of time I discuss with you, I see a way you're thinking right now from the Rubik's queue, how you can solve this problem. But I'm still want to see like how as a whole, this AISRE concept in the industry come up with this type of solution, like where you have something in the production where ai is working but i can still sleep man then over the night i don't have to worry like something is can go wrong and it will do destroy my whole infrastructure in just any second how i'm still confused and maybe i want to see more but i want to know from your perspective also like how do you see this
00:13:16
Priyank Upadhyay
Yeah. See, adoption in any space, even event in the development space was always a challenge. a
00:13:25
Imran
Mm-hmm.
00:13:26
Priyank Upadhyay
Anything that is a black box that you do not know how it functions, you'll never be able to fully you know rely on it. So that always has been a challenge.
00:13:36
Imran
Agreed, yeah. Mm-hmm.
00:13:38
Priyank Upadhyay
However, the
00:13:41
Priyank Upadhyay
values and the impact that we are observing in the industry. I'm very bullish on that. i i am seeing a single person starting a company and running a 10-man army just by you know creating agents in cloud code.
00:14:00
Priyank Upadhyay
so There are things, let's say even 90% or 80% of things, if they they can be automated, human beings are there to handle those 10%.
00:14:12
Priyank Upadhyay
Using it efficiently is something that we should tend towards or move towards. But as you mentioned, putting it inside a production is going to be...
00:14:25
Priyank Upadhyay
a challenge always.
00:14:27
Imran
yeah
00:14:27
Priyank Upadhyay
SREs, DevOps folks are always skeptical about putting anything untried and tested. So that is there. That is for sure will be there and that is good that we have it so that we can build better products.
00:14:44
Priyank Upadhyay
Now ensuring a good harness of guardrails, security practices and
00:14:53
Priyank Upadhyay
Basically the first approach that I take is zero trust, wherein you do not allow it to do anything.
00:14:54
Imran
yeah okay
00:15:00
Priyank Upadhyay
You first gain the trust and then only ask it to do some things that you would otherwise do not. Basically start with an observe only mode, any tooling.
00:15:11
Imran
okay yeah
00:15:12
Priyank Upadhyay
If something can assist you, it it becomes much more easier to rely on it and trust it. And if even that can help you solve, alleviate a lot of time, lot of latency in your workflow, it it is a useful tool. right The impact and trust is gained once you start to see what is the value it is going to bring in.
00:15:36
Priyank Upadhyay
if certain decisions are taken and there is a human oversight on that and it agrees that okay the decision the or whatever action items that are to be taken up is correct and after the approval the agents can do that in production as well because ultimately it's not about you know fully making it fully autonomous but ensuring that you have that trust and Basically, you should not waste your

AI's Impact on SRE Roles

00:16:06
Priyank Upadhyay
weekends on something which can be automated. That's my ultimate goal.
00:16:11
Priyank Upadhyay
It takes a while to gain the trust, but we are moving towards it.
00:16:11
Imran
ands
00:16:16
Imran
Agreed. I'm not denying it. Yeah. See, as you understand, the point from where the industry is right now is they there are, I have seen, I think, I don't know, hundreds of tools by this time, like people trying to solve inobservability, either from the intelligence way or faster, like searching way or There are tools also right now trying to figure out an incident out of the locks and try to figure out like by autonomously.
00:16:47
Imran
The thing which I still see as a AI SRE, like it's it's very big, right? are The things, the responsibility of the SRE or a people who is understanding this type of issue day to day and fixing, it's big.
00:17:00
Imran
Now in this responsibility and the whole, the boom of this, like putting AI into the production, and solving this now i i'm still seeing it will not going to replace human okay but i'm also i'm as i mentioned i know ai a solving lot of problems so i can see like ai will solve props like solve problem but maybe some to some extent i'm not sure how much but to have some extent asri will be like on the relief mode or a guy who is working on an on call he can be really more because there is some kind of ai is working
00:17:37
Imran
with him so that's how i'm seeing the industry right now is like looking at it but i would just like to understand also like is this something you also working on or do you or want to go beyond this
00:17:52
Priyank Upadhyay
my So the current situation is absolutely right. this One trend I can you know give you an insight on is while talking to customers what I am seeing is it's not me who is doing this something new.
00:18:05
Imran
Yep.
00:18:09
Priyank Upadhyay
Everybody is adopting ai in some form or the other and people are already trying to solve it internally in in-house products and This is what a job of an SRE is, right to automate whatever task is being done.
00:18:24
Priyank Upadhyay
If an incident comes comes once, it should not be repeated. That's the ultimate objective of an SRE. Everything should be running as usual. now To get that, everybody is trying out you know various kind of approaches.
00:18:39
Priyank Upadhyay
It could be a rag-based system, wherein you identify runbooks and you trigger those at the correct operation.
00:18:44
Imran
hmm yep
00:18:47
Priyank Upadhyay
Identifying false positives in alert noise, right? lot of alert come and you have to identify which needs my attention.
00:18:50
Imran
hmm hmm hmm
00:18:54
Priyank Upadhyay
So these kinds of individual use cases are being already being done by people. So adoption-wise, there already is a wave.
00:19:00
Imran
okay
00:19:04
Imran
okay
00:19:04
Priyank Upadhyay
What I'm thinking, and I could be totally wrong about it, but what my understanding is the role of an SRE, at least the incident management aspect of things, was a workaround till we did not have a proper approach or a technology to eliminate that.
00:19:29
Priyank Upadhyay
eliminate that
00:19:31
Imran
Okay.
00:19:32
Priyank Upadhyay
but we do have that now. I'll tell you one thing, so this is very interesting. Rubik's cube is running on Rubik's cube and if you check our status page you do not see any tickets because by the time the ticket is being created it's automatically being resolved.
00:19:38
Imran
Okay.
00:19:43
Imran
Mm.
00:19:51
Priyank Upadhyay
I trust that. I do have some contingencies in place but it is automatically resolving it.
00:19:56
Imran
Yeah.
00:19:57
Priyank Upadhyay
From our customers if I tell you the The lowest number goes to 15 seconds, 45 seconds and at an average of 4.13 minutes is according to all the customers that we have 4.13 seconds to sorry minutes to understand an issue and get to an action item, get to a root cause of why something is failing.
00:20:24
Priyank Upadhyay
And it it happens not just because of the AI, but because of how AI is tuned to understand the system better and give a better prediction.
00:20:36
Priyank Upadhyay
So the more you, right now, you, yeah.
00:20:38
Imran
But Priyank, agree, but think think this way. So I have seen systems which i'm on my audiences I have seen system generating bugs daily, like it's in a number of hundreds.
00:20:50
Imran
So imagine a system, a place where there is a code or there is a very large system and collectively, and at least one thing with 100% everybody will going to agree is bug is something we you cannot understand.
00:21:05
Imran
you cannot say there is a there is no bug in a system it's it's it's something is inevitable it will be there there will be corner cases in any cases and i have system like collectively 100 plus bugs generating every day so what you're trying to say if if this ai tuned work aisres or the tools like you building if tuned work this bug will go away this is something which and so I'm trying to understand because this requires a lot of fixes or maybe lot of this and the bug will not go away. Maybe it will reduce but do you are you saying you are targeting like there shouldn't be any bug in the system or you're trying to figure out like somebody you're trying to go in like reducing this.
00:21:50
Imran
How do you see? Because I know it will not going to be less. All system has some kind of bug and it's flooding every day.
00:21:57
Priyank Upadhyay
It's going to increase Imran.
00:21:59
Imran
Okay.
00:22:00
Priyank Upadhyay
I'm saying the the count of bugs will increase. You're hearing these news about AI tools deleting databases, right?
00:22:09
Imran
Oh, yeah, I told you guys.
00:22:10
Priyank Upadhyay
These are not bugs. These are massacres that AI is doing.

Future of AI in Incident Management

00:22:14
Priyank Upadhyay
Now, yeah.
00:22:14
Imran
And this is not a fun fun news man because when I hear something like this then it will go go give me more resistance to say and please the like at least after AI writing a code 10 times I ask my engineers to at least have a proper review because I don't want something like at least doing the SQL injection at the age of 26, 2026 and just
00:22:33
Priyank Upadhyay
the The problem is this Imran, it's not about faster resolution or the bugs wouldn't come, everything should be you know foolproof, but rather the amount and the volume of bugs or incidents or even the workload that is going to come in for the roles that I'm targeting is going to increase substantially.
00:22:34
Imran
yeah
00:22:45
Imran
Mm-hmm. Mm-hmm.
00:23:03
Priyank Upadhyay
And human cognition is not built in a way to support that. You cannot, so it's like this, you cannot just keep increasing the headcount and hope that all the bugs will be resolved.
00:23:15
Imran
Okay.
00:23:15
Priyank Upadhyay
Because it requires the same amount of debugging process, the whole MTTR process of solving an issue. But if you increase that, you know, he the count substantially and you keep increasing engineers to it, it is not going to cut it out.
00:23:32
Priyank Upadhyay
What I'm targeting is not to reduce the amount of work. That is something human beings should focus on. Write better code, do better reviews, bush push a quality code in the production so that it doesn't delete the database.
00:23:46
Imran
Yeah. Yeah.
00:23:48
Priyank Upadhyay
If something happens in the post-production system, at least 60, 70, 80%, my target would be to reach to a point where very less involvement of human beings are required for any kind of redundant tasks.
00:24:05
Priyank Upadhyay
If a bug is there, it should be understood and it should be solved if it is solvable. If it is not solvable, then obviously it should be know properly analyzed and reported back.
00:24:15
Imran
yeah See, i'll I'll give you one insight. I think you already know this. And when we are working on the Avesha, that time also we have faced this thing, right?
00:24:26
Imran
Any developer who is on call, if you go and ask, how's your job? How are you feeling on a whole week when you have done the on call?
00:24:37
Imran
yeah The answer will be very clear because I'm hearing this, I don't know from like four and half years from every developer who is on call. They don't want to do it. they every company has a this development team every company has this team who want to build product but this stuck on an on call for a week or two and trying to do something which nobody wants to do not a single guy because this is something this is the worst part i think in the nightmare i have seen because as soon as i create a roster or of your on-call people to go
00:25:10
Imran
I never anyone see happy. And I myself, I'm saying I'm also not happy to send them because I know their i know their potential. I know they in a week can build something or work on something which is way more like efficient or way more better than just looking for a dashboard.
00:25:28
Imran
I can tell you this. That's why I think I told you last time when we were discussing about the AI also, if something like this come to the industry and give this time back somehow.
00:25:41
Imran
This will be great for any companies because everybody knows, like everybody know the, in a month of a four week, one week guys is doing on call or at least in a two month, they are doing a one week of on call.
00:25:53
Imran
And this is the biggest time of loss as a, as a management side, as an EM side, I always feel, Oh my God, if there is a way I could get this time, that guy with, with so much potential to work on something else,
00:26:05
Imran
that could be huge impact on everything. But no, there there is no solution to it. We need someone who have, and that's a part of the worst part. Like they have to look 24 seven. It's not like, okay, 10 to five, nine nine to five. I'm just looking to dashboard and night it will go away because system is running 24 seven.
00:26:22
Imran
So this is some pain I can directly tell you, which is not only me. I have seen it because I was a part of this, but I've seen from the other side where I have to put someone on that and I'm really seeing a face not happy.
00:26:37
Imran
So Yes, I i also, i I want something this to come into industry so that it will solve a big problem, which everybody has, everyone, everyone has. But how is the question and probably I'm really looking for the answers right now, companies like Rubik's Cube, how this will be going to give this time back to the people who really wants to do something else, at least some but some some productive work.
00:27:05
Priyank Upadhyay
I'm seeing a good amount of of effect that we are making actually and and you are absolutely right about the pain and that is my personal grudge on the problem statement because I do not like being on call.
00:27:12
Imran
Yeah. Mm-hmm.
00:27:24
Priyank Upadhyay
I want to be free, be creative, do something new and obviously if there are times when I want to actually go and solve bugs but that should be my choice.
00:27:35
Priyank Upadhyay
So that's an opinion, like a developer side opinion of things. If I'm just an SRE, then obviously that's my role, that's my duties, and I have to fulfill that. But nobody wants to stay on call.
00:27:50
Imran
Yeah.
00:27:50
Priyank Upadhyay
Something created needs to come from human minds. That's my approach.
00:27:53
Imran
Yeah. Yeah, yeah.
00:27:56
Priyank Upadhyay
And we are seeing very good responses towards it.
00:28:01
Priyank Upadhyay
with people who have actually been using Rubik's cube, right? I am seeing a lot of shift in the ratio of their time. So the distribution time of an SRE typically is to solve an incident, manage an incident and automate the infrastructure systems or tooling or make it better, right?
00:28:23
Imran
Mm-hmm.
00:28:23
Priyank Upadhyay
So that shift is being...
00:28:23
Imran
Mm-hmm.
00:28:26
Priyank Upadhyay
very prominent in our customers. People are moving away from incident management to something which they have been parking for a very long time.
00:28:36
Priyank Upadhyay
They wanted to do an audit on SOC 2 compliance because that is coming up and because of the you know older incidents that were being piled up, they were losing focus towards it.
00:28:48
Priyank Upadhyay
There are situations where cost optimization is something people wanted to do for a very long time but are unable to focus on that because you know something or the other breaks keeps breaking every day.
00:29:00
Priyank Upadhyay
So this shift is there for sure. I'm hearing people, SREs, junior SREs actually coming up and you know talking to me saying that you helped me get my weekends back.
00:29:16
Priyank Upadhyay
I'm going out with my girlfriend now. So, this is a weird case, but yeah, I think one of the biggest challenges, you lose a lot of personal life and personal space because of these things.
00:29:28
Priyank Upadhyay
So, those things, and even at such an early stage, we are seeing a good good enough impact.
00:29:28
Imran
Oh, I know, yeah.
00:29:35
Priyank Upadhyay
So, yeah, I think it's it's going well.
00:29:39
Imran
see don't get me wrong but i have tried i was looking some of the tools like this in the space is a i say this space specifically and the term itself is very gimmicky aisre but it was mostly what i have understood is like collecting my own data which i already have it and dumping to i don't know maybe some kind of llm in the background maybe some famous one And the answers are exactly that helpful, which I can do myself. Yeah, maybe little bit of effort.
00:30:12
Imran
But for that much of an effort, I don't want to feed my personal very crucial information to the LLM, which especially

Security and Efficiency in AI Adoption

00:30:21
Imran
like you already mentioned few companies I've worked, those has very those Not only only that, like lot of companies is very productive about data. Every company should have very productive about their data, especially customer data or healthcare data.
00:30:33
Imran
So I don't know. It seems like again, like lot of wrapper kind of a thing means in this is space, which generally from last couple of years, people are trying to do with themselves. Let's ask assume you don't know what to run a command to get to scale a pod.
00:30:48
Imran
You just going to say, hey this is the error I'm getting pasted in the chat and say what I do. And I think chat GPT or any LLM will give you, okay, run this command. It will help you. The point which my concern was is how how this will be helpful if it is this much only means this is okay but this is not something significant going to replace anything or the process itself so the time back part which we are discussing is it real with this much of this with this part of the like this is the way to solve it i'm not sure this i think my engineers can do it only right
00:31:29
Imran
So if maybe I can wrong. So that's what I'm saying. Like I'm seeing a lot of companies doing this right now. so how do you see this?
00:31:35
Priyank Upadhyay
You are absolutely right now. I agree. And security, data residency, these are some of the major discussions and major points we do get even on the first call.
00:31:49
Imran
Hmm.
00:31:49
Priyank Upadhyay
And a very valid concern. The only thing would say in my defense Rubik's Cube is very flexible and controllable in ways which typically you wouldn't see in something like you know when you're using cursor or cloud code or chat GPT because they are not meant to do the these things. Rubik's Cube is specifically designed for such a you know situation.
00:32:19
Priyank Upadhyay
So having said that, One of the ways you can you know avoid pushing everything to a LLM and what a lot of companies who have who are in a regulated system setup do is use a local LLM or use something which, an organizational layer of LLM, maybe in AWS Bedrock or some other place, self-hosted ones.
00:32:31
Imran
Yeah.
00:32:41
Imran
Yes.
00:32:44
Priyank Upadhyay
Where are we storing the data? How is it a SaaS? It is a self-hosted setup, bring your own cloud.
00:32:54
Priyank Upadhyay
These kind of flexibilities we offer right from day one. so We have thought through it. We have totally understood it. And i remember my discussions with you and other security folks who are typically in this space because my you know when I started building it, my first pitch was that what are the rejections I'm going to get from the infosex side.
00:33:18
Imran
Yeah.
00:33:21
Priyank Upadhyay
because i i want to cross that and that's how Rubik's Cube is built fundamentally. You have every fine-tune control that you want to do and customizability and security best practices.
00:33:36
Priyank Upadhyay
In fact, you can ask Rubik's Cube to identify what is flawed in the system. So it's the other way around. no but But yeah, you're right that in a regulated environment, it's going to be a bit challenging, but people are adopting. you you are You are seeing a lot of new things being emerged from these companies as well.
00:33:59
Imran
Oh, yeah. Yeah. but See, I have seen big names also. The Giants Cloud also are in this space. Obviously, I was following that news too.
00:34:10
Imran
The only thing is, I don't know. that So, let me, so I was doing some research, but obviously you are an expert over here. What that means is, so earlier we say AI code. So that is people are right now firing developers.
00:34:26
Imran
So are you saying you working in this AI SRE space? So we don't need SREs. I don't. So my take on this one is.
00:34:37
Priyank Upadhyay
That's very controversial, Imran. yeah
00:34:40
Imran
so So, no, so my my couple of point on this is Priyank, I don't think so because see those guys are not. only do this like yeah
00:34:49
Priyank Upadhyay
well let Let me frame it better. so I know I have been talking to you and you have seen industries like TCS, Razorpay, Firebolt, Avesha, from big to small, all kind of things.
00:35:09
Priyank Upadhyay
Do you think such a tool is going to replace SREs?
00:35:16
Imran
Okay, let's understand what SRE role exactly does. So, they have the they have two parts, right?
00:35:28
Imran
One where they have to make system reliable, stable, which is obviously comes with the incident tooling and all those things. but The reliability doesn't come by the fixing issues. The second part, which is the biggest part of this role is to make sure this shouldn't happen or the tooling, the automation and the tooling is such a way they should catch very early and we should catch early and the obvious one shouldn't happen like scale, issue it due to the scale, issue due to the security, issue due to the, I don't know, numerous number of things.
00:36:02
Imran
This is the job. Now, if you are saying AI SRE is going to do the job of the debugging, understanding the infrastructure and everything. So what I am saying is, I think it is not going to replace, frankly, if you ask me, because a human with the cognition and the power to understand from the day one, how the whole system need to be built.
00:36:27
Imran
So infrastructure need to set up by someone, right? first Infrastructure need to set up by someone, someone need to like do these automations maybe the part where if the split was 70% do this automation and 30% the incident will reduce to 90 10 and if your claim is right maybe it can go 95 and 5 so I don't see that will replace but the part where SRE have to do this toil that can go away but again yeah
00:36:54
Priyank Upadhyay
Absolutely. the the That's the aim that we are targeting. not And we want to be SRE's enabler, not a replacement.
00:37:05
Priyank Upadhyay
because and who I know.
00:37:05
Imran
lot of companies Priyank's no lot of companies and lot of companies saying they they don't you don't want SR is this is a claim I'm asking you you are working from like what months now in this
00:37:08
Priyank Upadhyay
i know
00:37:15
Priyank Upadhyay
I think it's a very wrong narrative to say. Whoever my incumbents are listening, please don't do that.
00:37:28
Priyank Upadhyay
Even if you somehow build a built a tool of which can 100% resolve all incidents, that's not a job the only job of an SRE.
00:37:42
Priyank Upadhyay
So don't ever say it's going to replace. I'm not claiming that it is going to replace anything. I only challenge the role and the work, the task of incident management, which is not correct to be provided to humans.
00:38:04
Priyank Upadhyay
They are good at it, but they should not be doing it. That's my grudge. That's all.
00:38:10
Imran
No, that makes sense. I think you guys, this whole concept, maybe we need a rename AISRE.
00:38:20
Priyank Upadhyay
I totally agree. you Gardner came up with this this year in January. the The term AISRE.
00:38:29
Imran
I was going to say, it's from the Gartner.
00:38:29
Priyank Upadhyay
and
00:38:30
Imran
I didn't know that.
00:38:30
Priyank Upadhyay
yes
00:38:31
Imran
I read some article, then I got to know, OK, this is something calling AIS.
00:38:35
Priyank Upadhyay
I tried naming it, Imran, earlier. I tried to name this category Site Reliability Intelligence. i I still feel it is a much cleaner name because today it's AI if something else comes and tries to solve the same problem.
00:38:44
Imran
Oh, OK. Yeah. yeah
00:38:55
Priyank Upadhyay
it's going to be outdated. So, and I think the nomenclature also impacts how people perceive it And AISRE is just an AI tool going to, you know, replace or, you know, fix the problems of an SRE.
00:39:08
Imran
Replace. she's just Yeah.
00:39:13
Priyank Upadhyay
But what we needed is a tool which... actually does what an SRE does in an efficient way and helps reduce a lot of burden from them, take over a lot of burden from them so that they can focus on what is more important and it always is an assisted approach.
00:39:34
Priyank Upadhyay
So even if you're trying to say, let's say, set up some kind of an automation, it can do that. And there are lot of meta approaches that can be done.
00:39:44
Imran
there is one thing which i have done also in the past and i think in industry people know this and i'll give you i think i'm quoting the obvious but if a sre doesn't have a job of a sre they are the best devops engineer they're the best software also software engineer they can code they can also do the infrastructure work so I don't know why i have read, I think, I don't know, 10 plus articles I've read already in this because before this call, I i knew they were going to ask a lot of questions.
00:40:14
Imran
So why are people are fixating with they removing this role? This role is something which can be converted to something else easily. This guy's, so yeah.
00:40:25
Priyank Upadhyay
it It doesn't even need to be converted Imran. The the point is it already had that those responsibilities.
00:40:28
Imran
Yeah.
00:40:32
Imran
like Exactly.
00:40:32
Priyank Upadhyay
When Google coined the term of SRE, the intent was... So I'll tell you how the story of Google goes. and I think it was in 2003 when this term was introduced.
00:40:43
Imran
Mm-hmm.
00:40:46
Imran
Mm-hmm. Yeah, yeah
00:40:48
Priyank Upadhyay
They said we have sys admins who are doing lot of server side work. Then we have engineers who are doing lot of development.
00:40:53
Imran
yeah.
00:40:57
Priyank Upadhyay
How about we combine both of them?
00:41:00
Imran
Mm. Mm.
00:41:00
Priyank Upadhyay
They know how to write code. They know how to automate anything. You throw a problem at them, they will come up with a program to solve it repeatedly. What if we put those to do sys admin task?
00:41:14
Priyank Upadhyay
that This was where it came. This gave us a lot of beautiful things. IAC, infrastructure as code, Terraform and everything else.
00:41:21
Imran
Yeah, yeah.
00:41:22
Priyank Upadhyay
Cloud, Kubernetes. So everything, Docker, right? you You see everything that came in, became more programmable, became more engineering worthy and
00:41:28
Imran
Yes.
00:41:37
Priyank Upadhyay
Now you do not have to think when you want to spin up a container. right you You create that VM infrastructure, you run an application, you do not have to think on what the inner workings of when will I set up the VM, what what should be the memory and everything else.
00:41:40
Imran
Yes, yeah.
00:41:52
Priyank Upadhyay
right no These kinds of automations are was intended for the role of SRE, not being on call oncall and fixing issues.
00:42:04
Priyank Upadhyay
this is the This has already a name, right?
00:42:09
Imran
Agreed.
00:42:09
Priyank Upadhyay
On call, that's that's the name, right? so My only objective and I think my incumbents are also moving towards it that SREs should be assisted to go to their responsibilities what they were actually asked to be doing rather than you know unnecessary task or doing stuff.
00:42:28
Imran
Yeah.
00:42:32
Imran
I think that makes sense. And that's what I'm saying, right? So I have interchanged this role very frequently in my own, in a couple of years and last two years when I see if there is a requirement, doesn't have any kind of requirement on the system side or the first side.
00:42:53
Imran
they they easily comes to the role where they're just writing code automations maybe a platform in engineer is a better term to say like they can writing something which is used by across the team horizontal notification system per se so yeah so the point of the whole discussion is it's i don't see people why people are really like crazy going after what these people they are the one of the smes in the industry so yeah I got your point you are not chasing that but I think seems like but this whole as AI SRE kind of a thing is chasing that exactly like they want to seems like they want to replace the people and go into the system but I I love the name i sorry I think I read on the first of player your first version of the website also it's site reliability and it is the name is quite good at least the full form is quite good
00:43:41
Imran
And, but is how, how you see this evolving means this, now this term is intelligence is what exactly in the whole thing.
00:43:54
Priyank Upadhyay
I'll bring it up. But before that, I would like to know from you is as an operator, what do you see?
00:44:00
Imran
Yeah.
00:44:05
Priyank Upadhyay
How does SRE solve stuff and why they are able to solve stuff?
00:44:12
Imran
Okay.
00:44:13
Imran
How they solve stuff. So I think part of the answers we already, uh, like we have already discussed, we got to know. So they see, there is a developer who's writing code. They are understanding the system requirements. They, sorry, the business requirement and trying to convert it. And they are very good at generating that piece, which is business or functionally correct.
00:44:37
Imran
now from here there is a different journey start what that journey is this piece of code not going to work out soon now that require a proper resource allocation that require a proper idea how much resource it will take how much skill it needs how it will scale and from there on what's the tooling we require inside it is there any event we are emitting is there anything log we require if this part piece of code fails is there any impact of this or not now this piece is very unknown to the guys who is writing the business logic so how exactly asre works that gives this guys works is they know from the understanding of the piece of code also what it what may does maybe they don't know the business logic but they understand what it can does and the part where this is this code is con like containerized or maybe in some form where it is running in the infrastructure that part they will have not
00:45:36
Imran
So why we put people like this onto the firefighting? Because when something happens, they have a very good understanding to tracing back from the top to down. What that means is,
00:45:48
Imran
you, the system is big. We have worked with, we are quite fortunate enough to see the system like a lot of banks, lot of big companies when we are in the Asia also like how complex their system is.
00:46:01
Imran
So you cannot go and start searching for everything. You need to pinpoint. So these guys are very, they develop a tooling around their this system in in this system so that they know when where to pinpoint so they start tracing back from the top to down where the issue was happened who emitted that what who was the responsible service what were the services you meeting and what exactly triggered it near recent code old code what that so This is the job ASRE does and conclude with a total root cause analysis, what to fix them. A lot of time the fix is in the infrastructure itself. Maybe it's just throttling like CPU showcase or memory showcase is happening right now. any Anything in the infrastructure, they it quickly saw because they have a knowledge.
00:46:42
Imran
Now it comes to the code also. most of the time have seen this guy has a capacity to also even fix couple of lines of like switch and cases if you have given wrong and fix it right away so this guy's a very good understanding figuring out the rca good debugging so this is i have seen sre works in the action this is how they really work
00:47:03
Priyank Upadhyay
Right. And I'm just segwaying on whatever you said.
00:47:09
Imran
yeah
00:47:09
Priyank Upadhyay
as i was also mentioning from the start, right, that what we have built in Rubik's Cube is actually mimicking what an SRE does.
00:47:20
Imran
Mm-hmm.
00:47:20
Priyank Upadhyay
And you're absolutely right. The whole journey what you explained is exactly what the Rubik's Cube system also tries to mimic. So first thing you said is knowing the system, right?
00:47:29
Imran
Okay. Mm-hmm.
00:47:33
Priyank Upadhyay
Getting a view of the system and everything connected to it.
00:47:33
Imran
Yeah.
00:47:39
Priyank Upadhyay
And that is what we start with, you know, understanding your infra, your topology, your connected tools.
00:47:43
Imran
Mm-hmm.
00:47:46
Priyank Upadhyay
It could be your CI-CD pipeline. SREs also understand people. the roles and Spock's assigned to a particular you know module or something.
00:47:53
Imran
Yes.
00:47:58
Priyank Upadhyay
So they know what they are doing.
00:47:58
Imran
Yeah. Yeah.
00:48:00
Priyank Upadhyay
they They know how to communicate with them.
00:48:03
Imran
Hmm.
00:48:04
Priyank Upadhyay
Rubik's Cube also tries to do that. It connects to your Slack, your communication spaces and other things. So it knows who is working on what.
00:48:10
Imran
Hmm.
00:48:12
Priyank Upadhyay
The last aspect is the runtime context of what is happening and how historical context of what was done to solve something.
00:48:23
Imran
Okay. Mm-hmm.
00:48:24
Priyank Upadhyay
Combine all this, right? What you get is context awareness. The same thing that an SRE has, same thing we are giving to an AI system, an agentic system.
00:48:36
Priyank Upadhyay
Now benefit is whatever the SRE would have done, same thing the AI agents can also do given a proper hardness and this context that you have.
00:48:47
Priyank Upadhyay
Now, just say that there is a latency that's starting to happen. right You are seeing a longer API call. Now, you have a visibility of, let's say, a deployment that happened on an hour back.
00:49:02
Priyank Upadhyay
right
00:49:02
Imran
Mm-hmm.
00:49:03
Priyank Upadhyay
You can run db queries and check what is the exact latency. You can go and check Grafana and see what is the metric that is the network traffic metric that is happening.
00:49:16
Priyank Upadhyay
How many people are seeing that 500 issue or what is the latency gap between your last running state versus now. Having all this context, also knowing who pushed the code, what was the requirement to push the code in Slack discussions, it becomes much more easier to pinpoint and identify, okay, this is a new deployment that could be causing it.
00:49:32
Imran
Okay. Okay.
00:49:39
Priyank Upadhyay
Now, if you give this context to an AISRE or an intelligent system, agentic system, my bad, it can go and run tools. right It can go and investigate further, it can go and identify these things and then give you a proper analysis of what happened, why it happened.
00:49:49
Imran
okay
00:49:58
Priyank Upadhyay
Now once you do that, you approve it and it takes an action to fix it. It could either be a rollback, it could be maybe increasing some resources, freeing up something, there is a Kafka worker that has been sitting idle, so freeing up things, lot of you know ways it could be solved and that's what Rubik's Cube is capable of doing.
00:50:19
Imran
but what what happens when it not understand because see it's still a ai right it can have chances like uh yeah brain fog is right word but in a situation where it is it doesn't have a sufficient data to act on so maybe there is not chances to
00:50:32
Priyank Upadhyay
It says that it didn't understand. It says it that doesn't understand
00:50:36
Imran
Mm-hmm. Mm-hmm.
00:50:38
Priyank Upadhyay
and backed by evidences so something that it doesn't know it will still go and create some hypothesis around things go and investigate stuff there is no person involved now right there are some AI agents who is making some hypothesis running commands within your infrastructure your systems and tooling and gathering evidences and gives you some representation of a report with a lower confidence of This is what I did. This is what I found.
00:51:08
Priyank Upadhyay
I am not sure why this is happening. Now a human being still will go and fix things. But now they already have a list of evidences that was being collected and done before they even go and fix things.
00:51:22
Priyank Upadhyay
Now there could be a newer context that they understand or they know about. They will go and fix it. And now Rubik's Cube can also understand how it was fixed because it it goes and validates.
00:51:34
Priyank Upadhyay
It assists you. So it it can even remember that how things were fixed. Next time it adds as a context.
00:51:43
Imran
hmm that means it will not go rogue and delete my database
00:51:47
Priyank Upadhyay
Absolutely. Yeah.
00:51:49
Imran
okay i i i got the fair idea i think from the aspect of the flow you mentioned that makes sense the way a generally efficient way a human or a in a tech industry we have figured out to solve a problem if a tool agents or agent or an AI per se can mimic the same thing, yes, there is a high chances it it can figure out the real issues rather than i think suing a large number of data through some kind of like AI or the just a LLM wrapper. Yeah, that makes sense. Yeah.
00:52:29
Imran
I think it will be very interesting. I have seen the earlier version, but frankly, from last one month, you must have done a lot in this. I'm 100% sure. So i really want to see this in action to traverse this from start to end and see the RCA. But that' that's a that's a very with very nice way to do it rather than in reinventing the wheel. You're just trying to mimic something which is already proven in the industry, the best way to solve the problem.
00:52:57
Imran
Yeah, it makes sense.
00:52:57
Priyank Upadhyay
And as during the iterations, in Imran, one thing that we have been seeing is lot of times when you solve something which you there is an existing workflow and you try to mimic it and solve it, a lot of times you see certain things either which are flawed or intended for humans.
00:53:01
Imran
Okay.
00:53:21
Priyank Upadhyay
Like for our case, I can say one thing is
00:53:21
Imran
okay
00:53:25
Priyank Upadhyay
SRE do need pager duty or some kind of alerting.
00:53:29
Imran
and Alerting, yeah. yeah
00:53:32
Priyank Upadhyay
If it is an AI agent, it doesn't need a pager duty per se. It needs some way to get alerted, but not a particular tool to do it.
00:53:40
Imran
Okay.
00:53:42
Priyank Upadhyay
right So eliminating these kind of middle workarounds that were there or because the use cases were different, right?
00:53:53
Priyank Upadhyay
a Human being was a consumer versus now a agentic AI is the consumer.
00:53:54
Imran
okay
00:53:58
Priyank Upadhyay
Things have changed. Now we do not rely on visual dashboards but rather markdown files. no Telemetry data can directly be consumed by LLM because they are good at reading it, they are good at reading files.
00:54:11
Imran
Hmm. Hmm. Hmm.
00:54:13
Priyank Upadhyay
So it becomes much more easier. Regarding progressive discoveries or disclosures, we are seeing a lot of efficient behaviors when you do not give all the tools to the LLM.
00:54:27
Priyank Upadhyay
you tell it You tell it what you need and then progressively it goes and identifies certain you know avenues or skills and then tries to do things.
00:54:38
Priyank Upadhyay
You can even automate the whole flow or workflow of of know certain things. like Let's say my typical workflow is like this. When incidents come, somebody reports it to me, I would create a ticket first in Linear.
00:54:53
Priyank Upadhyay
I'd find the right person to be assigned to that issue or it could be also be me. And then then I pick up that ticket, go resolve it, and then close the ticket. right There is a ticketing system involved now.
00:55:07
Imran
Yeah. Hmm.
00:55:08
Priyank Upadhyay
You can teach Rubik's cube to follow this process as well. We have made it very programmable in that sense. So lot of companies who do follow a practice or a typical workflow and they want Rubik's cube to be personified.
00:55:24
Priyank Upadhyay
They want to use it as a real SRE. They can do it. So that's the you know approach that we are taking. that's That is what is actually working.
00:55:38
Imran
Yeah, I think at least the sound of the you and especially how we're building it, I'm pretty sure there seems like there is a tool hype right now. And I think like that there are companies who are really trying to solve the problem, but I'm also listening or I'm seeing rather I can say like it's something which is really really something, it's it's not required. The tool I'm seeing right now, which is claiming to be a SRE, just a bunch of wrapper on the llm also and and this and that type of tool is very much dangerous also because again
00:56:14
Imran
the whole I think one liner and the very important which I mentioned also putting AI in production this line itself has like I think that there are n number of companies like people want to verify check infosec developer I don't know CXOs also to this is really worth to take it this much of a risk or not so yeah but like the problem is real i'm also again telling you the ai can solve it i'm 100 sure i think i myself was trying i a couple of years back i told you there was one project right like and there are lot of data is in the slack slack doesn't give you that rack system where you try to search and it give you political these details but what happens you have a confluence plus slack plus different create a rag on the top of propriety llm or your own house llmn
00:57:03
Imran
have something maybe not for the developer but for something as an info book Like maybe a support engineer trying to figure out, okay, how this issue to solve? Or is there any information I have in this topic or not?
00:57:17
Imran
That was a very specific case internally I was trying to solve. So if with this much of a data, I able to at least increase 50% efficiency of a support ticket because that ticket must have solved someplace. There must be some kind of a SOP return or some kind of a things were already there as a runbook.
00:57:37
Imran
people may forget it but if you know where it is and how to follow it or what exactly you need a detail so this much problem people people are trying to solve something from the ai something like this but ai sre seems like a very complex thing and the solution is you're mentioning yeah the solution is also complex and it need to be complex so that that's fair i got the idea but do you think people want something like this in-house and as i mentioned like okay
00:58:04
Priyank Upadhyay
They are built-in. They are absolutely... no there are AI budgets being allocated in companies to assign people and do it.
00:58:13
Imran
Oh yeah, I know.
00:58:16
Imran
Yeah.
00:58:17
Priyank Upadhyay
The adoption is not the problem. And recently one of the challenges i am facing personally is build versus buy.
00:58:29
Imran
Hmm.
00:58:29
Priyank Upadhyay
And that convincing that I have to do why you know buying a tool would be a better fit because you cannot talk about that or give a proper defense Because personally, I'm also seeing a lot of efficiency in tooling like Cloud Code and Cursor.
00:58:51
Priyank Upadhyay
You can automate a lot of things. so
00:58:55
Priyank Upadhyay
i It always is a challenge right now but to your point, the adoption is not very difficult.
00:58:56
Imran
Hmm. Hmm.
00:59:03
Priyank Upadhyay
People are already doing it. Internal and in-house approaches are there from i think past two years and now there are plethora of options and of companies who are coming into the same space that we are in and trying to help help people out.
00:59:26
Imran
Yeah, totally makes sense.
00:59:29
Priyank Upadhyay
what's your So one thing on on the next segment that I wanted to move is from an engineering perspective, from person who has been NSRE, I know things, you also are very aware of the pain points and everything.
00:59:46
Priyank Upadhyay
What happens beyond that? What's a non-operator's point of view to these issues?
00:59:53
Imran
hmm okay you like to know how a part non-engineering team reacts right okay i think you also
01:00:04
Priyank Upadhyay
you You have been a non-engineer from last couple of years, right?
01:00:09
Imran
Yeah, yeah. mean so i I totally understand your question, but see, i I becoming a... So my role is to become a bridge. I totally understand your question, what you're trying to say, but the story is totally different the other side.
01:00:22
Imran
You, me, we know the from the part, from the respect of the engineer, this is something a very nightmare kind of a thing, right? but so this So this for the product, so this for the other teams also.
01:00:35
Imran
So the people who... as anxiously waiting in such scenarios where the incident is there or checking report okay why my system is so non-reliable maybe so the inks understand their pain also they they are anxious they they are seeing this number is increasing or decreasing or maybe going like very bad shape but they don't know how to do this they they don't have an answer to solution the answer to solution the answer to this type of concern is with us so
01:01:08
Imran
sometime and i rather than sometime most of the time they they probably don't understand this is not something a engineering team wants to do it this is something we we do fight every day the engineering team fight every day this is something probably we also don't want to fix or do we don't talk to don't want to be in the situation but We cannot fix this. this is This will be there. The fixing is to get something around the application more reliable.
01:01:40
Imran
The infrastructure itself needs to be very reliable. So yes, most of the time in non-operator way, non-ing way, non-ing people don't understand. And the understanding sometimes people like me, people like us in the industry try to give, okay, a simpler version, what exactly you don't know what's so going in the background. Say, hey, this is this is what exactly is going on.
01:02:03
Imran
And this is the reason why it is taking time or this is the reason why it has happened. But it was not a happy face. Like it has never been a happy face. So yeah it's it's it's not every time and believe me if someone people some people who is very much from the ic side or someone from the tech side it's difficult it's difficult to convert those technical thing to the people who are non-technical in this scenario so yeah this this this are the also problem they they're right in their own position because they are the owner of the product they they don't want shitty things happening every day or any any night
01:02:37
Imran
So I will say I've been the part of this now being I'm coming to this side also understand why it is important, but it's very difficult to translate that language.
01:02:48
Imran
And then this, I think every, every, every person must have facing the same thing. Yeah.
01:02:53
Priyank Upadhyay
Makes sense. And the I think one of the best things for a non operator is they can objectively think about the end goals, the outcomes that they want.
01:03:07
Imran
Yeah.
01:03:08
Priyank Upadhyay
And me as an engineer or somebody as an SRE would only have a very narrow or a specific view of the problem and the engineering side of side of things.
01:03:16
Imran
Yes.
01:03:18
Priyank Upadhyay
So the business objectives is something that we also need to understand. We also need to you know be translated. I think your your role is actually that bridge, right?
01:03:27
Imran
Yes. exactly exactly yeah you're right so a lot of time uh the both side i'm not saying the one side a lot of time people doesn't understand this severity when we say that this is severity maybe it's a very small thing for engineer but Impacting a single customer can be a huge for a business, right?
01:03:44
Imran
So how do you will know a customer who has an annual paying us like more than the 50% of the business. There are a lot of chances are like that. So it's it' it's important to understand that bridge the gap.
01:03:58
Imran
It's there and it's not from now. It's from very old time. That's why I think the role like my exist or the other people, a lot of product also are the engineer converting to product. Their roles are very important. Because this bridge is very important to fill both sides in understanding the part why engineering is like, so i see this way.
01:04:20
Imran
Engineers are always like, I will do something best, reliable. Day one, this should be there. Day 100% fix proof, this proof. And product is do as soon as possible. Someone is waiting, we can improve all this later. So see, this is this.
01:04:33
Imran
The both are right. i you who Whom you're going to blame? It's always the trade-off which matters, right? And sometimes this trade-off converts to the tech depth and we see that issue, the reliability. So it's always the race, man. And anyway i'm i'm not I'm not saying someone is right or wrong, but this is against like what you are giving the priority and what we have taken to this season is matters at that moment.
01:04:59
Priyank Upadhyay
Makes
01:05:01
Priyank Upadhyay
sense. One more question I do have regarding the non-operator side of things.
01:05:06
Imran
Yep.
01:05:09
Imran
Mm-hmm.
01:05:11
Priyank Upadhyay
Would you like, even role Imran, as an engineering manager, Maybe you assign an on-call person or an SRE, some issues, something that broke or even they figure it out.
01:05:27
Priyank Upadhyay
Would you like to know what is happening?
01:05:32
Imran
Yes, I e frankly, when something happens like this, i cannot be like sitting idle in the mind, sickly let them fix. I'll be like enjoying my coffee or the tea.
01:05:45
Imran
I really want to know at that moment. I want to know. I want to know everything. Yeah. I don't want to disturb them and ping them. Hey, tell me first and then you go fix it. But yes, I want to know. I really want to know everything because maybe they are missing the impact or the something on the priority side or which need to give it the importance.
01:06:06
Imran
That is the first factor. That's why I want to know key really where we have to focus first because from the aspect of the business or the knowledge business part. But apart from that, I don't know, maybe ive i I am an engineer still. So I really want to know what's going on.
01:06:20
Imran
So yes, I like to know. I want to know in this a situation. I always want to be active in this. But obviously, I don't want to disturb the folks who are already working on that.
01:06:28
Priyank Upadhyay
See why I asked this was because one of the ways I pitch Rubik's Cube, I'm also not pitching it, I'm just talking about it, that I like it to be an invisible product.
01:06:29
Imran
But yes.
01:06:31
Imran
Yeah.
01:06:38
Imran
Mm-hmm.
01:06:47
Imran
Yeah. OK.
01:06:48
Priyank Upadhyay
wherein you put it once and it compounds its knowledge, understanding, fixing things or even giving you RCA reports coming to you when you are needed, that's all.
01:07:00
Priyank Upadhyay
So for an operator, for a user who is going to use Rubik's Cube, it's most of the time it is going to be invisible and you only see the values when you know your time is being saved, you're getting understanding better.
01:07:13
Imran
Yes.
01:07:16
Priyank Upadhyay
But there is another aspect to it. the Since we already know why what is happening, why things are breaking or how things are getting resolved, this translation and the gap that you mentioned is much more easier to be transmitted to a non-operator who wants to understand what is happening, why things are flawed or let's say last whole month, what is the overall cost of things or why things are you know going this route.
01:07:44
Imran
oh yeah
01:07:47
Priyank Upadhyay
These kind of information wherein you do not want to disturb an on-call engineer, still you want to know about it. This is something Rubisthu can also help with.
01:07:57
Imran
exactly i can tell you industry follows weekly sometime daily reports which accumulate to monthly and that's called
01:08:00
Priyank Upadhyay
and
01:08:08
Imran
there is a couple of term i can tell you that's called product quality engineering quality and one of the key factor is this data because how do you know yeah how your product is doing because this data give you the exact idea okay well how your team kept us how your team seems like how your product seems like so yes these datas are important and lot of time we also go like circle back with the engineers to give us a data which again is it's not a very pleasant thing to do and we also know like for for them only solving problem is the key important thing other than just bullshit so yes so yeah this this data is important for the business but not maybe not for the ingenious who are working on it I agree yeah yeah
01:08:51
Priyank Upadhyay
Makes sense. Another last, I think it is the last segment.
01:09:00
Priyank Upadhyay
There's lot of geopolitics disturbances that are happening.
01:09:05
Imran
yeah
01:09:06
Priyank Upadhyay
How do you think is it going to impact the business of companies and maybe touch up on the Hiring aspect because one of the concerns I'm seeing is lot of people, you know lot of headcount reduction is happening in various organizations.
01:09:33
Priyank Upadhyay
How should we feel about it? What what should be done?
01:09:38
Imran
see no doubt uh or in it when the ai boom happens i think in our sector specifically where the most impact from the ai is writing code right right now and the efficiency is so better right now and i come from the era when the first list came from the uh
01:10:04
Imran
sorry yeah can you hear me yeah so remember the copilot by the github came so i see writing code by the air from that era to now where it is very efficient so
01:10:08
Priyank Upadhyay
Yeah.
01:10:19
Imran
The efficiency is so high, lot of companies, big names, small names, they are replacing ai with the people. Or maybe I can say they want people who has this ability to produce more impact with the help of AI, not just really manually typing code.
01:10:38
Imran
That's a two different thing. Enabling hims themselves or people itself to produce more and are rather than just writing the old way. So,
01:10:50
Imran
the whole geopolitical and the things right now happening is people are pulling or saving their money so first cause is they are every company is trying to save this money right because oil prices are high i don't want to go very political in this but understand this way they are trying to save cost and they are seeing right now the ai is something will save some cost because people cost are higher and the token cost are lesser but does There are very, maybe this is again another controversy I don't want to admit, but I don't see, this is efficient. Obviously, this is correcting the market, you can say, where one project doesn't require 50 people, it can be done in 20. That is a perfect makes sense over there.
01:11:31
Imran
But it doesn't mean you can get it done by three people. Again, the people will start burning out and your token cost will be skyrocketing. Right now, it is efficient because the tokens are less priced or I can say it's not really priced what it should be.
01:11:45
Priyank Upadhyay
PAM BENNETTLEE NARCY- Yeah.
01:11:46
Imran
Yes, the moment it will come to the corrected market price, I think people will understand the value of the the company will understand the value of the people again. It is, so it is, it will be there. It will not go away, but the way people are treating it, okay, replacing, wiping out the whole team, I think that's something probably will be very less in the future.
01:12:07
Imran
Now, how I see this, the tool like this, what we are doing right now is it's not going to, especially in the production or something like this, coding is pre-production things, right? Where the people, depth productivity is required and...
01:12:23
Imran
sorry uh depth productivity and other weight it's happening like the way you can generate more productivity in this but the again the agree the whole thing is converted to a faster deployment cycles so the faster deployment cycle means the better product or the faster releases of the features any companies I'm telling you right now so The tool right now in this space where the post-production or post-release is going to come, I think that will going to be also give addition to that. Like you're building it without any giving to second thought. Okay, is it right? Is it going to fail or what's going to happen? So first, right now you have a solution where you can write code fast. You can develop fast.
01:13:07
Imran
Now, I think in a six month or maybe very soon, in two, three months, you will see people also not thinking about a post-production what going to happen. Both are balanced by the AI tools and the real engineers and the sort of people are really focusing on building product, making reliability.
01:13:22
Imran
So this is how I'm seeing it. So yeah, I think a lot of companies are sacking people, but little bit correction need to be done. That's my personal opinion, but it's okay. It's always happens like this. Industry bin,
01:13:37
Imran
jumping on some of the news is very fast and it is correcting himself so i think it will correct and sometimes so
01:13:43
Priyank Upadhyay
Got it. No, it is absolutely what I am also thinking about. Regarding ai growth and adoption, people are moving towards it more efficiently than ever.
01:14:01
Imran
yeah yeah
01:14:03
Priyank Upadhyay
But also there is the drawback of the hype that is easily being circulated.
01:14:09
Imran
yeah
01:14:10
Priyank Upadhyay
so One challenge that i often see is to how do I differentiate between a signal and a noise?
01:14:18
Imran
and
01:14:20
Priyank Upadhyay
So personally when you talk to ChatGPT and ask anything, it will give you something. Or you give an idea and just ask it to validate, it will give you fair points.
01:14:33
Priyank Upadhyay
You ask it to
01:14:36
Priyank Upadhyay
at Tune it towards your perspective, it will do that and go and ask it to basically say against it, it will also do that. That is my problem.
01:14:47
Priyank Upadhyay
so You never know what is correct, what is incorrect and that judgement nowadays is reducing a lot.
01:14:49
Imran
that makes sense.
01:14:57
Priyank Upadhyay
you know Some false and misrepresentation of things. store Identifying that signal versus noise is getting more and more difficult.
01:15:08
Priyank Upadhyay
Lot of AI slop being flooded in the market.
01:15:09
Imran
Agreed.
01:15:11
Priyank Upadhyay
So that also is there.
01:15:12
Imran
Yeah, yeah.
01:15:14
Priyank Upadhyay
Awesome.
01:15:14
Imran
So, yeah.
01:15:17
Priyank Upadhyay
Imran, anything else you want me to touch on?
01:15:21
Imran
I think for today it is good. I like to read more about SRI. I think that concept seems interesting to me. And probably I'm pretty sure people like me more from the industry, like like to know this rather than in this AISRI hype.
01:15:39
Imran
And I told you that.
01:15:40
Priyank Upadhyay
I'll drop in some links in the video.
01:15:42
Imran
yeah Yeah, because this this seems interesting because again, I'm telling you, I'm not very fond of this. so I think Gartner named it. You mentioned this AISRI thing. I'm not very fond of but seems like this SRI.
01:15:55
Imran
is something something something something like we can see, like how you're trying to solve the whole problem from this way, from this category. So it will be very interesting to read. Yeah.
01:16:06
Priyank Upadhyay
Awesome. And Imran, any suggestions or message actually for our viewers since we kicked off our first podcast?
01:16:19
Priyank Upadhyay
Anything that you want to talk about?
01:16:22
Imran
listen to these people this team is very great especially priyank this guy is at least if you want to know something going on in industry fresh i think first of all follow him you will get a very new fresh information and to the viewers i think who is looking to this is I think people shouldn't be shy to try tools. That is the one thing I'll see because I am, see, mean people who from my generation or from the older generation, I'm seeing a little resistance, but I'm not seeing blindly faith, like putting like, okay, no human required, like we can replace everything by AI or the robots.
01:17:06
Imran
But there are some beautiful things going on. People should try tools. People should see is there something they can give. They get something new. They can get some valuable out of it.
01:17:17
Imran
So a lot of juniors right now coming out of the college guys, they in build, they have this capacity. Okay, there's new tool they try. a They do integrate these tools or the AI into their day-to-day life.
01:17:30
Imran
someone people like us like in from the industry tiny 10 years 15 years i i had just a couple of suggestion on this one is we should still we should also try something new we should also see some what's going on in the industry on the ai in different aspect also because it's shaping it very fast you cannot keep up with it if you're not looking you're little looking some away some for a minute or something you see something happened already We are seeing every next week there's a new model and it is like there's a huge difference with the last one also.
01:18:01
Imran
So we have to keep up with it and we have to see what's what exactly happening. So this is the first one really thing which I want to to say everyone despite of the either the guys in college or like the industry leader we should know what's going on in our surrounding especially in the AI. This is the one thing I really want to say to the viewers.
01:18:22
Priyank Upadhyay
Thanks a lot Imran for taking out the time and talking to me.
01:18:24
Imran
Yeah.