Integrating Networks and Servers for Better Management
00:00:00
Speaker
Making a network static and only scaling your server doesn't work anymore. You need to do both together. The biggest transformation that happened was the SDN. Now, the SDN that was sold is not the SDN of the reality. When I'm doing things in cloud and infrastructure, you have to merge the two. So we are slowly and slowly borrowing the things from cloud and applying them at the infrastructure layer as well. Oh, it's so difficult to do this via models. Maybe it's Something that the AI must be able to do and looking at logs coming in from various different ways and maybe able to cross reference and quickly figure it out. Whereas a human has to
Transformative Power of AI in Network Management
00:00:40
Speaker
do is a lot of manual work.
00:00:53
Speaker
Welcome to a new episode of observability talk. Every application we use is delivered through an IP network. And so, availability, health and performance of this network becomes one of the fundamental needs of the IT ecosystem ecosystem. All of us
Contributions of Dhruv Dodi in Internet Standards
00:01:10
Speaker
practitioners always love to talk about the application layer, how it is being developed, how it is being deployed, but mostly not worry about the network component at all. Today, I am glad to have Dhruv Dodi in the observability talk. Dhruv is a principal engineer at Huawei,
00:01:28
Speaker
Dhruv works on Internet standards, mainly at IETF, where Dhruv chairs the PATH Computation Element, SRV6 Ops Working Group, and the Education and Outreach Directorate. Dhruv is also a member of the Internet Architecture Board.
00:01:44
Speaker
I have known Dhruv for the last few years and have had the honor of co-founding IIE SOC,
India's Participation in IETF and Future Prospects
00:01:50
Speaker
India Internet Engineering Society. It is created to create awareness about IETF and standardization and also increase participation in IETF from India. One of the main ah aim for IIE SOC is to make sure that we bring in IETF meeting to India in the near future.
00:02:13
Speaker
Hi Dhruv. Welcome to you to Observability Talk. How are you doing today? Very good. Nice to be here.
Dynamic Networks and Digital Transformation
00:02:22
Speaker
Yeah Dhruv. So Dhruv, in general across the enterprises, digital transformation for statistical scale applications are embracing cloud and mobile, right? And that is becoming the new normal. From a network standpoint, what innovation and transformation have you seen been happening in the networking side?
00:02:42
Speaker
can you Can you talk about a little bit of what sort of new things are happening, ah which will really help this digital transformation which all enterprises are taking up now? Yeah, so the same thing which is impacting IT and cloud world where they have to move fast, the same has impacted the network as well. For so far, networks was just static. We used to configure the device, used to buy ones, used to just put them and then forget about the network completely. That has changed now because we need this kind of dynamicity.
00:03:14
Speaker
Making a network static and only scaling your server doesn't work anymore. You need to do both together.
Role of SDN in Modern Networking
00:03:21
Speaker
Now, if you are making your network dynamic, then you need to have a better ah like you know network automation and network observability, monitoring, all that dynamicity that you added in the cloud world. You need to start supporting in the networking side as well. The biggest transformation that happened was the SDL.
00:03:40
Speaker
Now the SDN that was sold is not the SDN of the reality. So we did realize that like the holy grail like oh we will have the ah the devices will be completely dumb and your everything will happen on controller that did not happen.
00:03:56
Speaker
But the concept of having the central control of your network, a place where you can orchestrate things, a place where you can do innovative things, while your devices still do fast things like oh switching the traffic very, very quickly or still running your control plane protocols. But having a little bit more programmability, a faster configuration,
Challenges in Cloud and Infrastructure Integration
00:04:20
Speaker
more agility in your network that came in with SDF and some of the same ah use cases that we had to satisfy was like, oh, our 5G is coming or cloud, everything is moving to cloud. So you need to have a little bit more ability to scale your network and scale down your network. All of that SDN sort of comes in and helps as well. The idea of edge computing because of 5G. Again, if you are doing something, you need to go and configure your network in much more dynamic way and the concepts comes in. And once you have the concepts for cloud, it's it's difficult to say that, oh, I will run my cloud sort of networking very different than my infrastructure level of networking.
00:05:03
Speaker
You can't end up like know differentiating between these are my physical device and these are my virtual device. I will have a different network management paradigm when I'm doing things in cloud and infrastructure. You have to merge the two. So we are slowly and slowly borrowing the things from cloud and applying them at the
Cross-referencing and Observability in Networks
00:05:20
Speaker
infrastructure layer as well. And all your like you know public cloud service providers are providing network as a service. So those kind of concepts are now bleeding in.
00:05:30
Speaker
across ISPs, across cloud, across data centers and slowly and slowly we are moving towards this transformation of looking at network in the same way how we look at cloud and workload at Microsoft.
00:05:45
Speaker
This is very interesting time for networking also, hit like we were discussing. um ah Things have been changing so fast, ah at least from application architecture perspective. ah Networking though it still remains at switches, network, router, for firewall and so on.
00:06:03
Speaker
But the way they are being used yeah in today's time has been quite different. And with this cloud thing coming in picture, what sort of challenges do you see ah when we are combining this virtual network devices ah with physical network devices or you are extending your data center ah to the cloud? So what what sort of challenges do you expect people to tackle?
00:06:28
Speaker
So the biggest challenge I think is we need to have a good way of cross-referencing and mapping things. So ah when you are trying to observe something on a physical device, which might have my multiple network functions running or have multiple servers running, which might produce their own data, how do you correlate data across these layers? And where the problem is, like especially the reason why we are doing network observability is is if you can infer good information around it.
00:06:58
Speaker
Now what we need to do is have a
AI in Automating Network Troubleshooting
00:07:00
Speaker
way of layering information and it can't just be like let's just dump the information and not correlate between them and that's a challenge even at like you know ITF where I'm quite active has been dealing with that yes we developed all these models and we had like you know oh there is data coming in from interface from from your routing protocol, from your infrastructure, from your performance. But how do you combine that two? How can you actually infer data from this raw stream that we might be getting from our devices? And that's challenging. And it's sometimes even difficult to standardize. But some early work has done started happening that how would we cross-reference things? How do we like, for instance, if there is an inventory data model, how does that correlate with routing data model?
00:07:46
Speaker
how do you say that what identifier of my inventory gets used to identify the same in a different database. So this kind of co-linking is very important and many people are working on innovative ways to do that. In fact, now since we talk about like, you know, AI and some people are imagining that, oh, it's so difficult to do this via models, maybe it's something that the ai might be able to do and looking at locks coming in from various different ways and maybe able to cross reference and quickly figure it out whereas as a human has to do is
Blame Dynamics in Network and Application Performance
00:08:20
Speaker
a lot of manual work and we have certain like when we troubleshoot networks and we open locks from multiple places and here we are trying to track between the two we have done those things manually but can that be
00:08:31
Speaker
done in a much more automated way. I will give you a view, some part of this very similar problem statement we have solved for application in business journey part. That how you correlate a transaction ID, which is available in one application, then we get mapped to another transaction ID in another application and then take those logs and then sort of correlate. We'll talk about that. We'll talk about when we talk to so talk about syslog.
00:08:58
Speaker
um this This is very interesting right that you talked about the correlation across things. In fact, even today when we talk to a lot of our customers, right ah they have their own
Full Stack Observability for Accurate Issue Pinpointing
00:09:10
Speaker
network, not team and so on. right and they keep clipping about that ah they are always first to be blamed when an application is not working. Because somehow people see it as a magic that but I have my application in a server and somehow it is getting delivered to the end customer. So what we really hear from some of these network team is because of this higher latency, because of this slowness, right ah people generally start blaming first that network is slow or connectivity is slow.
00:09:41
Speaker
or there are packet drops happening. You talked about correlation part. so i yeah one One other thing which we generally talk about is network is generally seen as a shared media. right So, multiple applications and multiple other things are being delivered on the same network. Now, how do you see some of these correlation happening across ah the application slowness with network monitoring data and people have started calling
Telemetry Advancements by IETF
00:10:10
Speaker
it something like a full stack observability. I am giving you sort of a view into all the layers of how your application is being delivered. So do you think that ah doing some of these correlation will help identify when it is a network problem and when it is not?
00:10:27
Speaker
Yeah, most definitely and ah this kind of like no thinking of applications and thinking of networks ah in silos is not going to work especially in the cloud world where they are so deeply intertwined. It used to work in ISP world a little bit because there is a very clear off handling that this is at a different physical Network, you have a physical infrastructure change when you send packet out of your data center into an ISP or a WAN. So there it's much easier for us to have a clear boundaries, but within your data center, unless you are doing a full stack observability where you are correlating how my applications are working with
00:11:06
Speaker
how my network is working and not just like, you know, at the physical layer, at the infrastructure layer, at all the different dynamics we need to be involved. And especially if you are, it's not just one data center anymore as well. Like you have multiple sites, you might be deploying SD-WAN. You don't really know at the start where your application is running. So you need to, unless you are really viewing all of these things together. It's very difficult for one to pinpoint things and otherwise this whole thing of, oh, I'm in a silo. So I'm going to say it's not my problem and just throw the problem all to the server team or to the network team. And most of the time, since we lack the visibility ah less, the network folks,
00:11:48
Speaker
server folks have very good things to prove that look my this thing is working super fine and I have ah full data network guys like oh I don't have a good visibility so must be my problem so that happens but to really solve the problem you need to have like a full visibility everywhere and there have been a lot of good innovations in the network layer as well like in ITF we are working on something called as a IOM where within the packet itself you start carrying telemetry so you can I figure out what flow, exact flow that I need to measure.
Real-Time Data with Streaming Telemetry
00:12:20
Speaker
And if that flow carry telemetry information alongside the packet. So no longer doing the old school way of polling and probing and looking only at the metrics that are currently being delivered, but you could in fact force new data collection in the packet itself to really figure it out where the problem is. So this kind of innovation could also be borrowed
00:12:44
Speaker
in the cloud world, in the data center world for us to trace paths across our workloads and figuring it out where the bottleneck is. ah and Whether it's a switch problem, whether it is an application problem, whether it is a particular flow itself. And sometimes what happens is,
00:13:01
Speaker
There is nothing wrong with you, but something totally out of the blue. Suddenly a backup starts and that backup eats up everything. And if you are tracing only my part and looking only at my view of where I thought the problem was without it looking at the whole system, I'm going to miss out. What is the root cause? And for that you need full stack up. Right.
00:13:22
Speaker
like Now, that that's very, very true. ah Just extending this discussion which we are having right now to another part of ah like you talked about ah the telemetry information being available within the traffic ah packet itself. right ah This is something which I see a biggest, biggest challenge um from the network monitoring perspective. We are very reactive.
00:13:43
Speaker
ah We have been using SNMP and doing polling um to collect health performance data. right And this is something ah I was talking to one of our customers and he was a knock leader. right And he was telling me that whenever application goes down,
00:14:06
Speaker
The application ah performance monitoring tool will have a graph which will be going like this and will show a sudden drop to zero because obviously there is no ah reaction transaction reaching there or request reaching there to the application.
00:14:21
Speaker
But for him in the network, if at all there is a network endpoint which has gone bad or something has gone wrong, he will come to know only after 3 minutes or 5 minutes of polling. ah So, he was telling me that his CEO used to call him as soon as the APM suggested requests have gone down and they will say, no, no, there is a network problem because of which my all requests are going down. And this person doesn't have a clue because the SNMP is yet to happen.
00:14:50
Speaker
right So, ah sometime back when we were doing some research, we realized that for this, people came out with something called a streaming telemetry, right, where rather than becoming like sort of pulling the devices, devices proactively send this data every 30 seconds or 15 seconds and so on, so you become much more proactive.
00:15:09
Speaker
up Have you seen any of these enterprises or data centers or ISPs started using this streaming telemetry more often ah now for at least getting a better better visibility into how the network is performing?
00:15:24
Speaker
Yeah, so within the cloud world, we see this a lot. And the streaming telemetry, especially in all the enterprises which are running their own big services, where you do need to monitor things continuously. And Google and Netflix, they come up with very good, um like know some of the major work that happened in the streaming telemetry was done by them. And they are actually sort of, at least in this sense, they are an enterprise. They might be a very large scale enterprise, but They are a private network looking at both server side and the network side. And for them, they are the one who innovated the idea of gRPC and gNMI and those kind of streaming things, which even ITF standardized with the Yang stuff as well. So doesn't have you don't have to use proprietary things. You ah can use ITF Yang models. And still those models can clearly identify
00:16:17
Speaker
I am looking for a change of this parameter. So it's not even push as in like, you know, there is no moment it changed. Immediately the device will send a notification and will send the information back. So you will get it in real time, not in terms of 30 seconds of 50 seconds or a timer has changed. It's on change.
00:16:38
Speaker
yeah That on change means like specially for cases where we really need to react quickly and we know that there are cases specially in financing and other things where every second matters and everything is important and those are the people who are definitely moving towards ah ah using streaming telemetry.
00:16:57
Speaker
ah Now, traditional enterprises, it's been a little bit slow, but ruin more and more as in when you move towards cloud. So cloud has been a bandwagon on which networks have changed. And it's the mindset has changed because earlier the network engineer job was very much different. Oh, I will configure a device. I have these scripts. I run this. I have a way of doing things which has not
Network Automation Potential and Challenges
00:17:20
Speaker
changed since the start of the networking.
00:17:23
Speaker
But and now even the young people who are coming in the network engineering point, they have a handsable, handgu ansible they have ah good understanding of new tools. They are in fact programmers who have good strict scripting and network programming knowledge. ah They look at the same way I'm running my servers, I want to run my network almost in the same way. And that mindset changed is where the streaming telemetry is coming in. Another thing is the scale.
00:17:51
Speaker
Now, as and when your scale is increasing, relying on traditional SNMP is not working. You need to have ah like you a streaming telemetry for me to react quickly because more time I'm losing and at scale on it, it just multiplies the delay that I'm going to face when I'm reacting to something. but And also the scaling part, ah one more thing that comes into a picture here is oh why am I actually doing this monitoring? It's so that I can find anomalies, I can find where the problem is. ah s And MIPS have data at a granularity, which worked in the past, but doesn't work anymore. And in fact, look at ITF. ITF is not writing MIPS
Evolution of Syslog in Network Management
00:18:40
Speaker
modules. New protocols are not writing MIPS modules. So the world has moved on. So if you're still stuck with SNMP, you're losing out on all the new things
00:18:49
Speaker
ah in the networking world that has been done, you're losing out on fine-grained metric collection, and ability for you to define what is most important for you rather than rely on a standard MIB module that has been written by the vendor, not changed in last 20 years, and still giving you at the device level information, but where you need granularity. So that granularity will come in if you have a ability to program, at this moment of time, I want this information.
00:19:17
Speaker
I care about this information, give me now. Maybe things have changed. I want to have a way of saying that I don't care about this information anymore. So I don't have to just get a stream of data unnecessarily. Only when I require, I should be able to switch it on, switch it off, change the granularity.
00:19:34
Speaker
and bring that mindset into my networking, then I will have a good view of why I need this real-time data. Just for the sake of it, if you people say that, oh yeah, ah the data will be fast. But if I'm not able to use it to solve a problem, which was, I was not able to solve before that needs to happen together. No. So three things which I take up, I take out from this one is obviously a very loud, very clear and When I started working with IETF, my first RFC was on web. I mean, I've always had this question why we are basically making it so difficult um for everyone to get the right metrics quickly and use it for whatever insights we want to bring it out. So bang on that. I also believe that SNMP is not the way to go forward. And I believe that the streaming telemetry or like you said,
00:20:29
Speaker
it the device or the network telling you that okay this is what ah the challenge is happening right when you have configured it correctly ah at giving you that insight ah in terms of what actions you should take now is where I also believe that the the whole industry should be going towards right because this is so critical I mean if you look at right network is always seen as, okay, yeah, it is available, it is there. And people have generally not really got very
IPv6 Adoption Driven by Cloud and IoT
00:21:00
Speaker
deep products where it gives you deep visibility into what is really happening. eight And I will attack connect it to the next thing, which again is very, very critical what we have seen, but it is not being used correctly ah is the IP flow.
00:21:16
Speaker
right Now, IP flow ah um we have seen lot of time our customers are using, right but the configuration is either incorrect or it is not enabled in the right place, right where you can really ah use the visibility or it is enabled in all interfaces with both ingress egress, it is giving you double sort of data and so on.
00:21:41
Speaker
right So in your experience, have you seen ah IP flow being used ah as part of the network observability ah and really people are able to use it to get better insights into how their network is being used.
00:21:57
Speaker
So there are two things there. In the ISP world it is very common because that's the only source of information of what are my big flows, what are my small flows, what is video more and all these information is so critical for an ISP to figure out how its network is being used.
00:22:16
Speaker
Now what happens is in the in the cloud world or in the data center
AI's Role in Network Anomaly Detection and Management
00:22:20
Speaker
world, some people say, if I'm getting that information from server itself and I am i can get like and what flows are being generated, why do I need to bother? But unless you are correlating the two and you rely only at looking at data at one layer and not correlating it with the later and later layer, that's how you're going to find your problems. That's how you're going to know what is going wrong. If there is an intrusion in my system, if I only rely on looking at where I have a visibility and not looking at the things and just switching the light off when it comes to network and not bothering, then how are you going to detect when something is going wrong? So the only way to detect is by monitoring, by continuously being aware of what is happening
00:23:03
Speaker
ah is my critical applications are the one who are using my network more or so there are some random flows somebody is doing a backup somewhere which is eating up everything in my network somebody is browsing somebody is doing a video transfer or even logs and images and disk images sometimes have such a huge size and all of this takes your network so until you blindly also look at network forget application for a minute have a visibility in your network via IP flow to figure it out. How is your network being used? Who is the main, which can give you like you know a lot more information. There are so very good tools now available. You don't have to go and write it and like know you need to invest on something. There are open source tools that you can simply rely on that will show you in pretty graphs
00:23:53
Speaker
how your so networks are being used, which are the elephant flows, which are the small flows, where is the spike, what are the patterns. You need to look at the IP flow itself, then correlate with your application. and That's when the correlation goes incorrect. That's how you find where the problems are in your network and what you are either looking incorrectly or something is configured incorrectly or even the applications are behaving, not the way how you imagine it by looking only at the application layer.
00:24:22
Speaker
Right. So this IP flow has a lot of good use cases that we need to start using more and the ah in the ah in the cloud world as well, but not relying only looking at things at the application.
00:24:35
Speaker
Very, very it true because what we have realized is that sometime when you are looking
Intent-Based Networking and AI's Future
00:24:40
Speaker
at the slowness part, which we talked about, right, ah you might be blaming the network, I mean, you as in the application guys, but the thing could be that there is a lot of TCP handshake is happening continuously to even send one packet, right? How do you figure that out?
00:24:58
Speaker
IP flow right when you at least can get more details in terms of what sort of flows are happening for your application and get a view is this intended is this what ah ah should be happening for application and then maybe able to figure out okay is there any application fix required library fix required and so So that is very true. up So this these there are two discussion which we had so far. right What is the metric part which we said that okay either SNMP or telemetry, ah streaming telemetry and so on.
00:25:29
Speaker
And second is the IP flow part which basically gives you how your network is being used. Now coming back to the devices again, ah we earlier we discussed the correlation part where you you want to basically correlate what is really happening in the ah network devices. How do you see the syslogs which basically tells ah anybody and everybody that what is really happening within the device.
00:25:53
Speaker
like ah interface going up and down, adjacency is going up and down or a BGP never has gone out or some other events which are happening. eight And how that is impacting your overall ah end user experience because that basically screwing up on your application delivery in some form or fashion. Maybe it was taking a shortest path. Now because of certain events, it has started taking maybe a longer path.
00:26:17
Speaker
ah or maybe going through some other AS in the network, right? So how do you see this Systlog being used as the wealth of information? So Systlog, in fact, is one of the most primary, the first thing, the easiest thing that anybody can tackle. The format is very, very standardized. It's been there for a very long
Dhruv Dodi's Interest in Internet History
00:26:40
Speaker
time, but it was being used only as like you know ah by ah debuggers or folks afterwards to look at, oh, when I'm debugging, let's look for these things that the interface went down. Did something like this happen? So it was always to look for records afterwards. And that's how we thought of logging. But we have to change the mindset of using the logging more as immediate action, immediate alerting. There are well-known alarms that your device is already generating.
00:27:12
Speaker
How can I use that to triage and figure further information out and even trigger some actions that will help me to find the root cause very quickly. Rather than two days later, I'm going through the logs to see and correlate across devices. So a lot of good work has started happening in there. I've seen products which you can customize the alerts.
00:27:36
Speaker
like even if the syslog is at a different level, you can customize that when to actually create an alert based on the raw syslog data, which will depend so much different from environment to environment. ah So first is using syslog as a trigger for further analysis for further reaction. One very common thing is I was mentioning about IOM. Many people use this log as a trigger for that. Moment something like ah a log happens, I will immediately create, ah send a message from a controller to the device on this particular flow, start carrying an IOM header and start collecting where along the path I'm seeing the problem. Oh, I saw the problem now. Let's dig down further.
00:28:19
Speaker
and figure out more information so that once either a manual person or later on, if we have a self-healing system, that's the holy grail that we are working towards, we are not there yet. So even if a manual person or network manager is looking, can we give him the right set of information collected for him to look and deep down quickly, but otherwise we used to waste so much time first to just collect raw data manually, open it in three windows and try to scroll multiple windows and figuring it out.
00:28:49
Speaker
but where the problem is. So, collecting the right set of information using syslog as a trigger for that, that's the easiest thing that we can do in our system and specially we are also used to logging in the application world.
00:29:03
Speaker
You have such a good dynamicity there as well. The idea of correlating the logs coming from application, the logs coming from ah from network and correlating and combining them together to give the better context to the person who is actually debugging. That's a game changer. Lots of good open source, lots of good products in this space already existing, borrowing a lot of things from the application side and now applying them in ah syslog world as well. so It's like we have to come up with the same sort of a programming and a server mindset. The things that we are doing for microservices monitoring and this, we can apply the same in the network side as well. In fact, like you mentioned, right? We actually have done exactly same thing.
00:29:51
Speaker
where we are now ah ah talking about full stack observability, where we are saying that if we know which network element is used by which application delivery, right we will be able to give you a single pane of glass for that application health and performance, where we will also bring in the health metric, performance metric, IP flows, as well as logs for all those network elements also into the application performance view.
00:30:19
Speaker
Exactly. up One of the other things which we basically have used Syslog for ah in one of the large ah public sector bank is to sort of give a view what all has changed recently. In your network, maybe you are seeing the effect or impact maybe ah two hours later.
00:30:41
Speaker
or maybe ah next day because um that day nobody really realized. right So syslog will be able to give you a view okay in this particular router or this particular switch somebody actually added a new protocol or enabled a protocol or ah disabled a particular ah interface. right And then you give sort of what all events has happened in my network.
00:31:03
Speaker
and he's a Then you correlate it to some of your most recent problems. okay Maybe ah somebody actually updated upgraded the iOS or whatever Juniper OS some image and two days later it basically is giving a problem now.
00:31:19
Speaker
So some of those things basically is some of the use cases which we have seen ah in custom environment, but I completely agree with you correlating it with final application performance. It's something which the time has come and we we have definitely started doing it.
00:31:36
Speaker
You talked about one very interesting thing in terms of using this as a real-time data, which basically can be exonized to do something else. ah And up ah time and again we have seen a lot of enterprises have a lot of small small scripts.
00:31:52
Speaker
or tools they have built internally for their network where they will run a script and then they will say okay, ah I want to ah maybe shut down certain interface or bring up some some interface or ah change ah ah maybe specific very specific config in specific interfaces and so on. So this sort of become like an automation ah which basically extend to like what you said, maybe some event has happened you get something done. right So do you see enterprises started using automation to fix network issues itself? As in maybe some bandwidth needs to be increased in some system or an interface, NIC card is giving some problem, so you reset or restart the NIC card and so on. Have you seen some people start using this?
00:32:40
Speaker
Yeah, so most people's journey with like, you know, network or automation, especially at the enterprise level is usually done in a step-by-step way. I've seen in the ISP, it's always a big overall. That's like, it's a whole, we're going to change the whole system and we're going to have a new product and there's a whole new mindset change. Whereas in the ISP, whereas in the enterprise world, it's usually very much in a step-by-step, which is a good thing because for an enterprise network is just a periphery. It's not something that you are it's It's there to meet some other requirement and you don't want to overhaul something that's going to change i change everything in which you run your business. and So most of the things doing it step by step does make sense. Finding out, having a view of which are my tasks in my network, which are very repetitive.
00:33:28
Speaker
which are maybe even ah low impact that they are not going to trigger a major change. This will not impact my application, but I have to do them continuously. Things like maybe I have to back up something or I have to check my image. Even there is some auditing that I need to do continuously. So any of those repetitive tasks, those are the ones that one can easily target. Many times people have scripts for them, but Instead of manually scripts, finding ways to when to automate those scripts should run. And as I was mentioning syslog or some other trigger in the system itself that can tell me when this condition happens, then I should um like kind of ah run a particular script, which can be very simple. But then taking the next step would be Again, looking at ah which other steps I can automate. ah Think about network monitoring. ah We talked about syslog being a trigger for me to do certain actions. like Those are the first major things that people should do. Focus more on the network observability and monitoring part a little bit.
00:34:33
Speaker
The next steps could be then the configuration part. And again, in configuration, bringing a new device, that could be a very simple thing that, oh, when I'm bringing a new device, these are the set of ah steps that I need to run. This is how I mentioned the security. This is how I'm going to get my keys. All that a task can easily be automated.
00:34:53
Speaker
uh configuration change can be tricky and many people are sort of like uh this is where I'm become uncomfortable and completely fine that you since you are not changing your configuration too much you don't have to even automate that part but especially cases where you are doing a lot of repetitive tasks which are error prone when done manually When doing this via automation tools, via checking via automation tools, so you do make a change, you verify you and have health reports, have generate information. Those things can be very easily
00:35:29
Speaker
automated and people have good experience of this when they are doing this with servers already. So this whole concept will not be something which will be new to an enterprise who is already doing things in cloud and data center. It might be for a brick and mortar enterprise with an wifi setup or with an ISP who doesn't want to check change, who believes my network setup is done and I'm not touching it, adding a VLAN, what are you talking about? That's going to be a huge debate. Why are we doing this?
00:35:56
Speaker
We are talking about a person who is anywhere in a dynamic environment. For them to do this, adding the automation features within their same toolset via which they are managing their servers, they can also start doing a network and good open source tools, good dashboards are available already. So it's all about like changing the mindset and start exploring where ah what meets their requirement. Correct.
00:36:23
Speaker
but So, one thing is definitely there that it is much easier or much simpler to do for like for example, a server or a database and so on. right Because ah we know and we see that okay this is happening. Network, ah there is always a challenge. right And ah when when I was talking about this automation part, I was more looking at from, okay, I want to maybe reset a specific interface because maybe it has a large cube build, maybe there is a NIC problem, right?
00:36:51
Speaker
some of those very ah very ah some problem which basically might happen and people does not have a solution other than just reset. I mean even if I do not do it by automation description, I will they still log into the switch or the router and then just do a up and down of the interface. So those sort of problems again We are mainly looking at it from how I can make sure that my application is still being seamlessly delivered. So if my application has a challenge, and if we identify that, okay, network, ah maybe interface is creating a challenge there, at that time, maybe somebody should ah do such automation, ah when they're very, very sure that yes, because like you said, right, changing a configuration is still a ah difficult thing on a live router.
00:37:43
Speaker
This is very interesting. I will just switch gears a little bit. I want to talk to you about something which is much closer to your heart, IPv6, which I have been seeing for the last 20 years and maybe there will be people who would be saying that I have been seeing it for the last 30-35 years. right oh And you and IIE shock has been spirating this IPv6 implementation. You guys are doing a lot of work ah in this area. I remember ah You guys did a workshop in an ID Suratkal with Moet and team, right? ah While we see that it is already being i mean adopted by service provider, Jio already provided an IPv6 address and so on, ah what are your thoughts on adoption by data centers or enterprises, large enterprises and so on? I mean, are you seeing some ah people adopting it now or they are still pushing for IPv4 and continue to use that?
00:38:41
Speaker
So especially, I think the cloud is again a game changer and in this way because in cloud world, people are super comfortable using, especially in public cloud and getting like, you know, in fact, IPv6 addresses in public cloud, sometimes even it costs you less. It's basically having global addresses there, which are ah which are like IPv6 is easier. And sometimes even people have seen performance advantages and some other advantages there. Now,
00:39:10
Speaker
At traditional enterprises, it's a little bit of a hard sell. And we know that like you know the things that we were talking about earlier with network automation sometimes also apply for IPv6. It's not like their applications will start behaving differently once you move to IPv6. Some people believe that, oh, I'm moving from 4G to 5G and I will see some kind of brand new technology. It's not it's not a brand new technology. Is it just a different version of an IP which fixed some of the issue that we realized with respect to uh like you know the earlier version ipv4 and sometimes what has happened is even itf maybe came up with very good translation and transition technologies that continue to satisfy the needs of the enterprise they may not satisfy the needs of uh big isps and that's why isps are not using it but when it comes to enterprise home users this thing
00:40:01
Speaker
Yeah, they are satisfying their needs. So something has to change. And we are saying some changes, which is even policy level. Like for instance, there is like US OMB department saying, or any person who is doing business with US government, even recently, Vietnam government had the vision. And this is the vision that why are we still using older technology when new technology is available? And so let's use it. So some trigger like that, that's always good.
00:40:29
Speaker
The other thing is maybe you need more addressing because of IoT, because of amount of cloud, even there is things like pricing change, some external trigger, which may not be a technology one, which is something that we would wish there was a technology reason, but at least for enterprises, we have to realize that there isn't a technology reason. correct I can before continue to meet their requirements ah so far. So for them, the trigger would be something layer seven and above.
00:40:57
Speaker
so So something that which is out of the networking world. I think we're seeing that those triggers exist. okay In US we saw, in Vietnam we are seeing, we are seeing sometimes like you know the same enterprises going in IoT side and suddenly realizing since my IoT network is so much IPv6 based, my cloud public cloud is IPv6 based, why am I not doing it for my infrastructure layer as well? Now we do know and which is we say that this cannot be done lightly.
00:41:26
Speaker
I'm moving to IPv6, it's possible, but you need to plan it properly. But it is not impossible either. And why it's not impossible, we just saw with NITK Suratkal you were just mentioning, these were students which led the world ah work on transforming their campus network, which is a pretty big network, multiple departments, thousands of students using it around COVID time when everything was online, able to move bit by bit part of their network from IPv4 to IPv6 now they are in dual stack they plan to remain in dual stack and for an enterprises that's again one thing there is going to be dual stack so you do need to realize that yes there's going to add up a little on your making a cost because you are doing dual stack but it's not
00:42:15
Speaker
it's It's not like you are doubling the cost completely. It's just a little bit more work, a little bit more monitoring. But the tools are already there. If you are a modern enterprise using modern devices, modern software, it's not something that IPv6 is untested, unused. It's been there for so long. It's just a matter of you switching it out.
00:42:35
Speaker
And we saw, like, you know, for instance, in broadband cases in your home network, then Airtel and Jio started using IPv6. Did people even notice? Nobody, nobody complained. There was no, no, suddenly when we go and say, oh, I'm on IPv6.
00:42:51
Speaker
ah Actually, your applications will feel exactly the same. Everything is IPv6 enabled already is just that we haven't switched it on. And it's a matter of switching it on, doing testing, especially if you're using some of your inbuilt scripts, inbuilt tool, that is where you need to do more testing. right Anything that you are relying on well-established open sources, well-established products, those are already well-tested, well-utilized sorts. I hope we go there step by step And hoping we will get there. Some enterprises may never, and that's okay. Like IPv4 is not going to die in that way. The IPv4 will continue to work. It will continue to work with v6 part of network. For instance, geo can move to v6 part of your network can be v4 and that's completely fine.
00:43:39
Speaker
Right, right. Do you see security as another reason where people can actually be pushed to use more IPv6? In fact, it's the other way around. People, because they feel they have a NAT and they are so used to like, you know oh, am I going to expose my IP address to the world? And the false sense of security that NAT offers, and you have to realize that it's a false sense of security. Can you actually remove your firewall and say that just NAT is enough in your network? No.
00:44:07
Speaker
then why is but this idea that like, you know, oh, if there is a global address, that's going to suddenly change the one. Do you have a firewall? The firewall is there for the reason. And it should anyway monitor, like you monitor your V4 traffic, even though you have a NAT, you need to continue to do that for V6 as well. And it meets all your requirements there anyway.
00:44:29
Speaker
But one of the encouraging thing I will tell you there are there are ah RFPs which we are seeing now where they are expecting some of these observability but products or platforms also to be IPv6 ready because somewhere I think they are putting it in their roadmap.
00:44:44
Speaker
that they want to have, either if if not fully i moving to IPv6, at least dual stack. Correct. So any observability tool that use IPv4 address as a key is is wrong. Like you should not be saying that, oh, i'm my scripts are only going to work when IP addresses are 32-bit. So yeah, I totally agree. That should always be the case that IP with both IPv4 and IPv6. Very true. Very true. Now, again, jumping to something which is much more oh correctly in everybody's mind.
00:45:14
Speaker
AI and GenAI. I think you touched upon something with you you, you wanted something like AI to come in and solve that problem. But do you see AI or GenAI is already being started using by some of these um ah data centers or service providers and so on, um from the network perspective? And any specific use case you can talk about?
00:45:38
Speaker
Yeah, so ah there are two things that are happening in AI world at in the networking. In fact, we have a BORF this time, which is the other way around. Since we are moving towards AI and since there are so big LLMs, what is the impact on my network?
00:45:54
Speaker
Ah, which is, which is not what you asked, but I wanted to pitch in since there is a boff this time. And since many people deal with data center world, this could be a something which is worth looking at. It's called high performance, a wide area network HP one, where they are looking at.
00:46:10
Speaker
how what are the changes that we need to do in our protocol layers to meet the high performance requirement that are coming because of this high workloads that are happening. So that's a totally different direction. Now let's talk about how I can use AI and and like a Gen AI to run my network better.
00:46:31
Speaker
or in network management in fact in IRTF there is a network management research group which is actively looking at this again at the risk as a research problem first as a research problem what are the set of things that we can actually do there are looking at ah like you know there is in fact they are also thinking of this thing called as an intent networking you must have part of this as well, that the whole mindset of why do I need to still talk ah of network in terms of configurations rather than giving a very high level idea of this is what I want and somebody who can break those intents into actual configurations that needs to be further monitored and verified and all those things. A totally different way of doing this. Actually there are products already available, which as a starting point of this, especially at the networking side as well.
00:47:21
Speaker
their AI will play a very important role. Because even for translation of things from an intent into configurations, it cannot be just a fixed set of rules. If these are fixed set of rules, you are back to ah where we were. You are just changing the way you do configuration. This needs to be something which has a little bit of extra intelligence, which detects things, which figure out how to break intents into configurations.
00:47:48
Speaker
But looking mostly at the network observability side, the main thing which people are looking at is actually a self-healing part and the predictive part. So monitoring is giving me information as it exists. Network observability is more than that where I add maybe the correlation part.
00:48:07
Speaker
that I'm not just looking at monitoring one thing, but I'm observing. I'm observing things but based on multiple metrics based on cross layer, cross referencing, looking at locks, looking at multiple devices. I'm observing what is happening. right Next would be woop using AI. Could I predict things? Could I And even jumping to predict, I think the first problem we're solving is, since AI is good at looking at patterns, it is also looking good at finding when the anomaly is happening. So for us, so far is, if you're only monitoring for when a log message says something has gone down, that's sometimes too late. Can I start looking at anomalies much before that, which is not a very clear event in my network or even a threshold,
00:48:57
Speaker
Because those things we can easily do via monitoring as well. ah if my If my performance is going at this road. But figuring out, am I going in wrong direction? And is it to be expected? Because maybe the application road is going high. Of course my network is going high. But my applications are saying they're not sending any traffic. But my network is going in totally different direction.
00:49:17
Speaker
It's an anomaly and that detection is where AI can play a very significant role. And this work has started already in a new working group in ITF. They are calling it network management operations, which she is looking at all of these things, like even bringing in things which are not done in ITF like Kafka, like looking at how to ingest network telemetry data into Kafka in a much easier way so that these AI tools that exist and I can feed this to AI tools in a much better way where they can detect anomaly, detect patterns much easily. Moving away from anomaly, the next step would be predictions.
00:49:58
Speaker
that ah but before things go down, I can predict what's going to happen. Predict things in terms of also when I need to increase my capacity, decrease my capacity based on past patterns of how my network and systems have behaved around a particular time, around a particular event, be able to add agility in my network based on past predictions.
00:50:19
Speaker
And then people are thinking of Gen AI, especially at the troubleshooting layer and like, especially in our side where tickets are being generated, customer service. That's where a lot of Gen AI and this thing could also be used. And in fact, like you were talking about, can I feed in my locks into a model who can give me in a pretty picture?
00:50:41
Speaker
find correlation between these things, which are some of the problems, which we see a lot of LLMs in other levels already solving. This is just another set of data. So a lot of different people are looking at this problem at different, different layers. And it's very interesting. In fact, IAB is organizing a workshop, which is in December. And as you know, I am part of the, I'm one of the IAB members as well. It is called Any Mops.
00:51:05
Speaker
where they are looking at next era of network management operations. And we hope a lot of these things with respect to ingesting, how do we make network telemetry, network monitoring for the next 20 years? What are the set of tools that we are missing out? And in fact, looking back, did Yang and Netconf and this thing solve everything? We do realize it didn't.
00:51:27
Speaker
Yeah, i direct some of the requirements, it did not meet some of the other requirements. And in fact, in the industry, we know that there is confusion because of the amount of the young models there. There is ITS model vendors, young model, open config. So again, it's time for us to maybe look back.
00:51:45
Speaker
as well as look forward. And I hope that workshop could be a pretty good way. And even folks from enterprises, sometimes in ITF, we are missing the inputs from ah enterprises that much. It will be good to have feedback from them as well. And if they wish to participate, they can reach out to me and I can provide them with information on like how to participate in in this workshop. They can write a position paper, write an expression of interest. And it's an online workshop, so it's easier to participate as well. Very nice.
00:52:14
Speaker
so ah I think ah ah whatever you said about part of the AI use cases, right? ah In fact, some of it we have done. i When you were talking, right, I was feeling so excited. Yes, yes, yes. Okay. You are talking about something which we did some part of it. One of the things which we did this for, like you talked about anomalies, right? We actually use some very similar thing for capacity planning.
00:52:40
Speaker
So, for example, you have a van router sitting in one of the branches. and somewhere you start realizing that it is taking more bandwidth, right? Then can I go ahead and start predicting or at least giving a view? ah If I need to either this bandwidth utilization is either oversubscribed within whatever bandwidth is available or undersubscribed. So do I need to buy more maybe in three months time ah given how like you talked about the pattern part?
00:53:13
Speaker
how it has been changing over the last three months. So maybe next six months I may have to buy and additional bandwidth. Or there are cases where it is completely underutilized and maybe we should reduce the overall bandwidth. So this is one use case which we did. ah The second one which you talked about was the overall ah ah ah like event correlation sort of thing we have done because there are so much event happens and in in network everything is interconnected to each other. So let's say a parent device has gone down behind thing behind that device anything which is there everything basically will start showing down though they are really not down right. So sort of so many outlets get generated
00:53:54
Speaker
within that few small time till that parent devices come up ah that we basically try to correlate some of these things based on the topology as well as based on how these events have come both again looking at the pattern as well as the topology to correlate some of these things using event. What I really believe actually is that network ah monitoring and observability part has not really been touched so well like the all the other part has happened and I find this very interesting that IDF actually is doing a new working group around this because this is a very very interesting area right because network is something which would give me the very first signal rather than ah application telling me that okay some problem is there.
00:54:43
Speaker
right so if If network can give that almost on a real-time basis, that would be something which will be very very interesting to look at. I will try joining that workshop also which you talked about. ah I think I will be very interested to see what sort of discussions they are having because I felt SNMP was quite decent for its time but Yang and Netconf, whatever they did, ah as soon as they started letting people do proprietary stuff, it lost the play.
00:55:09
Speaker
right ah And, because of which again we are almost in the same situation as what SNMP has been. right So, maybe something which is much more standardized ah needs to be brought in and and we will see how ah this particular ah workshop happens and then see if we can participate participate in somehow.
00:55:27
Speaker
right I look forward to that. ah So I have towards the last my my last question, one of my favorite, which I generally asked to all my ah people whom we invite. um help What is your favorite books? And can you share ah if it is multiple, can you share when you go to what ah book in what situations?
00:55:49
Speaker
ah So may actually I read history the most, so it's not related to her to the networking world. ah this thing so As far as reading, my preference is always more on the history side. All the William Dalrymple books, all Indian history, Roman history, all that stuff is pretty good. But one of the histories which is interrelated is into that history.
00:56:12
Speaker
And one of the very good books in this is where the wizards stay up late. It's a very good book on ah the origins of internet. And since I've been participating in ITF and you talk about like, you know, internet architecture board and some of the history around that as well, like look, how did we even come up with this idea of, yes, we need to standardize this. And this idea of an internet is mind boggling. And the idea that we could ever replicate something like this is equally mind boggling. I don't think so. We can agree on how to do internet ever again. It was at that moment of time, that set of people, that set of push that it somehow happened. And it's the whole history is amazing to read about. So that would be my suggestion.
00:56:58
Speaker
anyway ah but So, this is something which I think is like people keep saying right that there are certain events happen which basically change the whole thing what we basically people will be getting into. I think internet was definitely one such thing which somehow happened and florist like anything, ive and people are using it day in day out without even thinking that this is internet.
00:57:26
Speaker
or this is a network of networks. So that is very interesting. Cool. Thank you so much Dhruv for your time. and As always, I love talking to you and I am sure maybe we can keep going and keep discussing ah different things. But yeah, I mean, we need to close it today at this place. ah Thank you so much for your time and insights. Really, really appreciate your time. Yeah. Thanks Bharat. Thanks everyone. Bye. Hope you found my discussion with Dhruv insightful. If you did, please consider sharing it with your colleagues. You can learn more about Dhruv at
00:58:01
Speaker
www.thruthdodi.com. For more information about Vue Red Systems, please visit us at www.vueredsystems.com. Thank you so much.