Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
024 - Is AI coming for your job? image

024 - Is AI coming for your job?

S1 E24 · Stacked Data Podcast
Avatar
178 Plays10 months ago

Is AI coming for data jobs?

Gaurav Tiwari, an Engineering Manager at Spotify, joining me on The Stacked Data Podcast. I first encountered Gaurav's insightful perspectives on AI at The London Analytics Engineering Meet-up, I had to get him on the show!

With a deep-seated passion for AI, Gaurav brings a critical eye to this rapidly evolving field. We dived deep into the implications of Generative AI on the data landscape and how it will impact your roles and responsibilities in data...

Gaurav, shares his thoughts on how AI will impact the life of a data professional…

In this episode, we cover:

✅ Understanding Generative AI

✅ GenAI at the Consumption Layer: Discussing how GenAI is reshaping the interaction between businesses and data through analytics.

✅ Benefits of GenAI in Analytics: Exploring the potential benefits GenAI could bring to self-serve analytics platforms.

✅ Challenges and Solutions: Identifying the biggest challenges when integrating GenAI into analytics processes and how to address them effectively.

✅ Strategic Investment and Pitfalls: Guidance for organisations on where to start their investments in GenAI and potential pitfalls to avoid.

✅ Data Engineering AI Impact:

✅ Challenges Specific to Data Engineering: Examining the unique challenges data engineers face when integrating GenAI and how to overcome them.

✅ Optimal Strategies for Implementation: Recommended strategies for adopting GenAI within data engineering teams and aligning methodologies with new tools.

✅ Tools and Technologies: Highlighting specific tools or technologies such as Infer, TurinTech AI and more

Gaurav's insights and expertise make this episode a must-listen for anyone interested in the evolving landscape of Generative AI and its impact on data analytics and engineering.

Recommended
Transcript

Introduction to 'Stacked' Podcast

00:00:02
Speaker
Hello and welcome to the Stacked podcast brought to you by Cognify, the recruitment partner for modern data teams hosted by me, Harry Golop. Stacked with incredible content from the most influential and successful data teams, interviewing industry experts who share their invaluable journeys, groundbreaking projects, and most importantly, their key learnings. So get ready to join us as we uncover the dynamic world of modern data.

Guest Introduction: Gaurav Tuwari

00:00:34
Speaker
Welcome to today's episode. I'm thrilled to have Gaurav Tuwari, an engineering manager at Spotify, joining us. I first encountered Gaurav's insightful perspective on AI at the London Analytics Engineering Meetup, an event we help organize here at Cognify.

Gaurav's Passion for AI

00:00:51
Speaker
With a deep-seated passion for AI, Gaurav brings a critical eye to this rapidly evolving field. Today, we will dive deep into the implications of generative AI on the data landscape. Gaurav, it's great to have you here. Can you share first off for the audience a bit about your your career journey up to now and I suppose yeah what ignited your your passion for AI in the AI space?

Gaurav's Career Journey

00:01:14
Speaker
Yeah, thank you very much for that introduction and I'm happy to be here. So I have been in the space now for in the data space, generally speaking, for about 12 years now. I started off working in Singapore and in Malaysia, working as a
00:01:30
Speaker
analyst at that time, which was very much focused on pricing algorithms and so on and so forth. And this was very much what we call in today's world data scientist. It's just at that time data scientist was not a very key word coming in, but I was working and this was my first job coming right out of my master's in information system, which was a very research-based course working with PhD students and so on and so forth. and ah After that role, I basically was hooked on to the idea that data can be used for a lot of analysis, a lot of reactive and predictive analysis. right so You can get a sense of what's happening, but at the same time, you can get a sense of what can happen based on what you have done. so
00:02:09
Speaker
I've had a lot of back and forth between data science and analytics. And that's where I felt like this is an area where I feel very passionate about. And I want to see how we can use data to its advantage. And it kind of correlated with the time where the digital evolution was that at its very initial stages, but at the same time very promising. So everyone had or had started collecting data, or to some extent had like some cloud-based data. right And that's where it was very easy to access access that information. so That definitely inspired me to kind of build a career and basically move on towards

Roles at Facebook and Spotify

00:02:42
Speaker
the field. As I progressed, I worked in different companies across Europe and and the last two roles that I was working as leading data teams at Facebook, at Meta in in Dublin, and then as of now for the last two and a half years working as an engineering manager at Spotify, which is a bit more software engineering role, but at the same time very driven by data.
00:03:02
Speaker
And those are the things which makes me feel and believe that this is going to be a case where more and more organizations are going to be. And it's it sounds a cliche word at this point in time, data driven, because everyone is trying to be. But data sets are the center of all that's happening. And therefore, I feel that this is a very interesting area, too. work on due to build a career and at the same time to contribute to to this community and basically share what we have done so far or what I have learned so far. And when it comes to AI, I think everyone is talking about it. That's for certain. Everyone wants a piece of it. But very few people actually know what's happening,

Skepticism About Generative AI

00:03:40
Speaker
right? And I have seen folks who have close to zero understanding and nothing wrong with that, zero understanding of generative AI, but at the same time trying to understand how they can fix everything up.
00:03:51
Speaker
based on some video that they saw somewhere or based on the fact that they can prompt chat GPT, which is good. It's nice, but it's not concrete enough. It's not it's not relevant enough to say that this is all that AI can do. So I felt that this is a pretty good place for me to basically use my expertise that are things that I've done in the past to create a bit more awareness specifically of general in the data community, because a lot has been told about how generic and revolutionized the way ah You can create content, you can create marketing assets, so on, so forth, right? You can create images, videos, text, and so on so forth.

Potential of Generative AI

00:04:28
Speaker
But when it comes to the advantage of Gen AI in tech, that's an area that needs a little bit more focus.
00:04:36
Speaker
and within tech also data and that's where I spend a lot of time that I've been kind of trying to follow what's happening and how can I based on my influence or based on my understanding of AI basically bring in insights for people. So that's where my inclination lies. I've been speaking over the last few months around how JNI can basically be play a big role in the way data analyst works, in the way that analytics and engineers work or data engineer engineers work, right? And that's something that, yeah, I drive a lot of energy out of it and I find it very motivating. That's great to hear your your journey. It sounds like data has always been a a key passion of yours. and And you obviously mentioned there that, you know, AIs force organizations to become data-driven. And I think, you know, everyone's been chasing this, this data-driven approach.
00:05:23
Speaker
Maybe it's more the data teams driving that, but but now what I've seen with our clients is they're finally getting significant buy-in from senior leadership, from executives, even though it might be they might be oblivious to to to what they're actually talking about.

AI Literacy in Organizations

00:05:37
Speaker
They want to be a AI literate, is a term I've i've heard, and and start utilizing data more effectively, which I think overall is is positive for the data community, because now there's buy-in from the top, which hopefully unlocks the investment needed, but we need the relevant data teams in place to effectively utilize it. So um yeah, interesting to hear your your point on that as well.

Understanding Generative AI

00:06:01
Speaker
As you've said, you know, not everyone understands necessarily what
00:06:06
Speaker
generative ai is we've all seen it you know it hit this the the scene with a with a with a boom chat gpt but but ai has been around for for for some time so i i suppose could you set the stage with you know explaining actually what generative ai is Yeah, I think that's a bit of a, that could be a podcast in itself, with generative a is but I'll try to summarize based on what I understand and what exactly is my ah explore working knowledge of of generative AI. But I feel like it is a generative AI is a part of the overall artificial intelligence ecosystem. So there is this notion in artificial intelligence where you would have
00:06:46
Speaker
something called general intelligence, a general AI where basically a model or anything or a computer or a machine can learn from what you do, take decisions and basically do things that it hasn't learned before as well and basically adapt to it. And that's the basic notion when you look at people showing AI and they show this robot or kind of sentient kind of beings. right But what GenAI is a very small subset of that whole philosophy. GenAI is basically underlying models, or as they call machine learning models, which essentially
00:07:22
Speaker
Learn from the context of the tech or the information that you have provided to it right so when you and you provided information it learns and basically learns from billions of parameters or sources. and transforms or understands what's happening and creates an output which is based on what it has learned from. So the term generative is basically it's generating something, be it a text, image, music, or even code at this current point in time for for engineers, which is based on the information that it has learned from it, right? So there is a notion like how Gen AI is different from what we have done in the past, because a lot of things that we have done has been
00:08:00
Speaker
if you want to call it traditional AI to some extent, it has been very much focused on how we can analyze a particular dataset that has happened in the past and use that with some certain formulas and algorithms to predict what's going to happen in the future, right? And this is very restrictive or very ah deterministic to some extent because the answers or the responses that you will get from the AI, from the generative AI. Traditional AI models is very restricted to the past data and the way you're asking questions. The the key difference where generative AI is taking the stage and why everyone is talking about it is because it has the ability to understand things and basically give an output that is very relevant to what you're looking at and transform them at at the way that you're talking about, right? So essentially it's a large model
00:08:46
Speaker
ah machine learning model you can call it language model which will will basically give our text or an image model or so on so forth that has learned from billions of parameters of text images so on so forth and therefore can give you an answer to the questions that you have right so that's the way I understand generative AI and it's key to differentiate generative AI with general AI because a lot of people assume that this is general AI and that's not the case. so For anyone wondering that, that's a key difference to kind of keep in mind. yeah yeah I think later we're going to dive into to some some other areas of AI as well. But as you said, your your expertise and where you've tended to to focus your your knowledge um is is around how AI is going to impact data. and I know that when AI was
00:09:35
Speaker
yeah first generative AI first first come out, everyone was scared, you know, we're going to lose their job, right? But um I think as time has passed, it's become clear that we're still quite a way away from that.

AI in Data Consumption Layer

00:09:46
Speaker
And I think today, I'd love to dive into yourself with how how the data community and and our listeners can really enable themselves and their their organizations and their data teams with AI. So I think it can be used across the whole data lifecycle. And I think maybe a good place to start will be probably the the consumption layer of of data. So yeah, let's understand
00:10:09
Speaker
more about how AI can have practical applications at this consumption layer of of data, where the business directly interacts with the data that we see. How do you see GenAI reshaping this interaction? I think that's a very good point to kind of look into how consumption is happening with and how GenAI can basically enable And I you think you said the right word enabled our current teams to basically do more as opposed to replacing the role altogether, right? I think that when we met at the Con, the Meetup, I...
00:10:43
Speaker
I had a very clickbaity title about will Gen AI replace data engineers, right? So I think that's one of the things that should bring peace to the mind of individuals and stuff that, in my opinion, there is a lot of applications for Gen AI in in the data industry, but it's still not at the time where we can say that it can easily replace you, right? So from a consumption layer point of view, Data is is as good as what you can make out of it right and trying to find a way in which people can have more access so that they can make more analysis is the part that everyone is trying to do right so we have this.
00:11:20
Speaker
notion of self-serve analytics and everyone has the ability to make insights and data like data patterns and by basically doing things very easily without having to learn things, without having to learn a tool or so on and so forth. right So I think this is where JNI has a very big impact and a big opportunity to bridge that gap. Because we talked about in the past how self-service basically helping out in bridging the barrier to for more non-technical folks or business folks to get an understanding of what is happening in the business. And that's basically the place where we call it the consumption layer, right? Like how do businesses interact with with the data sets that they have? No one or not everyone would like to be technical enough to basically write all these
00:12:06
Speaker
technical right there code so but whatever it is right and therefore we interact with data analyst or our dashboards and so on so forth and that consumption layer is very static at the current moment right so the if you have built a particular dashboard it's going to stay there and that's basically it so That's where I feel that there is a good place where JNAI can come in and basically interact with the businesses or the non-technical audience or the business leaders in helping them understand their data. right So how can you modify and make this consumption layer a bit more dynamic is where I feel that JNAI has very practical application. And it sounds more theoretic, but if you look into the practical side of things, there are orgs or companies or players in the field which are already doing or bridging that

AI for Non-Technical Users

00:12:48
Speaker
gap using JNAI. One of the big things
00:12:50
Speaker
that speaking very much from a large language model perspective that Gen AI is able to solve is there is a very long-standing problem of text to SQL, right? Like how can you convert text into a SQL language because your business audience does not know how to write SQL, but they have business questions that can be answered using SQL. And that's that's the role of a data analyst, right? most of the times, but imagine you have 300 questions and that's one data analyst to answer all these different requests. That's where having a model that can translate that business request in and in a natural language form or a text form to a SQL part is where the consumption layer basically stands. That's where the consumption is, right and that's where JNI can basically drastically improve those interactions. and
00:13:36
Speaker
That's where I feel like it is where the reshaping is happening. like How do you bridge that gap between people who don't know or want to or don't have to learn the technicalities of things? but still be enriched to with the data that they need to drive their business value. For example, if you are a salesperson, if you're working in a sales ah side of things you know for business, I would argue that you don't need to learn SQL or you don't need to learn how to navigate through a complex semantic layer of a dashboard. Because why would you want to? like Your focus and your area of work is very different. I would rather have
00:14:16
Speaker
If I was the CEO of that company, I would rather have this person in the sales be more focused on making bigger sales as opposed to learning data, right like learning how to work the dashboard. So that's the gap, I think, is where a consumption layer ah can be filled in by Gen AI. And that's what I think about is a very, very interesting area that we need to kind of look into as we go into the future. yeah I think that that ties in with what you saw a mentioned there with, I believe it's the episode that's going to be coming out after after this one we with David Jay, who's the VP of AI ah at Cube, a semantic layer company, and how they talk about how self-serve users can interact with with natural language questions yeah with a semantic layer. I think that's where where the real power of self-serve can come from, right? The business needs to ask their questions, but we need to contain them without going full fledged onto the the the data warehouse and upskilling in skills. So, yeah how does this translate, I suppose, to self-serve analytics, but also to analysts? Well, what benefits are they going to get from Gen I and how can they improve their ways of working?
00:15:27
Speaker
Yeah, I think there is there is a massive benefit in the way that I feel data, folks, generally speaking. Having worked in data teams for about the last 10 years, I can see there is a significant shift in the way of working that data analysts go through on a day-to-day basis. So I gave a presentation some time back where I mentioned how throughout the last 20 or 30 years, the field of data is evolving in different waves, right? So we have had the previous generation where people were actually creating data warehouses and data mods and then creating cubes and so on and so forth. Then came this interactive thing, right? Like another way where we had everyone adding things into Tableau or LookerML kind of dashboards and basically now the more focus was on creating those consumption layer.
00:16:14
Speaker
as opposed to finding things, then there was a focus on creating more self-serve interactive dashboards where people can basically do things by themselves and basically have more things. and If you look at in all these cases, what is happening is the amount of time that a data analyst is looking to solve ad hoc problems or basic problems is decreasing. right so If you make a dashboard interactive, it basically solves a lot of questions like, hey, what about this country? What about that happens? And what about this happens? So you can embed those functionalities already in an existing dashboard. And that makes the life of a data analyst much easier, that they don't have to create five different dashboards or analysis for each of the different questions that you have, because you can wrap it all up into one dashboard. right
00:16:59
Speaker
And what this gives essentially is time back to you. right So now, if you look at a data analyst in 2024 versus a data analyst in 2004, I'm quite certain the speed at which a data analyst is working now or the variety of information, variety of tasks that a data analyst is doing now is significantly higher than someone at 2004. So it's it's all part of the evolution of how what exactly is helping. And speaking of JNI, what's going to help you is how can you automate these things? How can you reduce the amount of requests that you are doing, trivial or ad hoc requests? That's one way to look at it, which can free up more time for you. Because at the end of the day, a data analyst
00:17:43
Speaker
is very much focused on driving insights that can help the business grow in their trajectory, right? And ad hocs are one way to do it, but there are more interesting things that you

Streamlining Data Analysis

00:17:54
Speaker
can do. And there are studies which mention that an analyst generally spends over 50% of their time just doing ad hoc requests, right? They're frustrating for ah for times because the context switching is quite tricky as a data analyst's ways of working point of view. At the same time, they drive very little value because once you create an ad hoc request, probably will never be used again. And you did spend significant amount of time as an analyst creating those things, right? So that's where I feel that there is a massive benefit for
00:18:23
Speaker
a data analysts to basically leverage an AI in this space and basically think about a way in which you have an interface or a consumption layer, as we mentioned, which can automatically answer questions that are trivial for you. And you don't you do not need as a data analyst as well to onboard your audience to a tool that they haven't used before. That has happened in the past. right When you have to use these interactive and self-serve tools like Looker, Tableau, so on and so forth, even though they are very easy drag and drops and so on forth, they still require a nuance of technicality and you need to be onboarded to them. right I have seen
00:19:01
Speaker
business folks, marketing folks, sales folks saying that they have attained a tabular certification to basically now they can navigate through tabular, right? And my hypothesis is if you need a certification to use a self-serve tool, then it's probably not self-s self-serve straight away, right? You still need technical nuances to kind of figure a tool out. So I think that what comes with that is as a data analyst, you need to constantly provide that onboarding, constantly provide that support for new members to basically onboard to a tool to basically use it properly. Those are the things that can be solved very easily with GenAI. So imagine not having to be constantly disrupted by these kind of requests, right? That's a huge benefit. And that's where I think data organizations or analytics organizations right now, specifically speaking, can start leveraging GenAI into their their businesses. yeah
00:19:49
Speaker
So if you're an analyst or if you're a head of data or a leader, you know you should be looking at what are these trivial tasks I'm doing? What am I doing? Am I repeating week on week, month on month? and And then looking at how you can look to replace and and automate these these tasks in the right way with the with the right technology. and And as you say, it's it's then making sure that the business is also on board with that. Because I think at this stage, there's always going to be some sort of upskilling of the stakeholders. and And yeah, I would say it's not a new exercise right as a data lead like when I was a data analytics manager or I in my current role as well, I constantly try to find ways in which you can reduce our duplicates and optimize the way of working. I'm saying that this could be a very good tool in your arsenal knowledge and AI to basically do it at a much bigger scale that you have done.
00:20:39
Speaker
10 years back when self-serve analytics or Tableau was still in its early stages, maybe 15 years back, ah you still were looking at ways to automate things. And we realized, oh, a dashboarding could help, and it can automate the questions, right? That was one tool in your arsenal. Exactly that way, JNI is a much more pronounced, much more important and effective tool in your arsenal that tool ah data leaders can utilize at this current moment. So we've obviously said that it's a nice segue. You've said that it's a tool in in your box, but yeah how can you actually, you know what what tools are out there? What what investments can can a team make? We're talking about how they can use this, but can can you give a practical example of a way that this could be implemented?

Selecting AI Tools

00:21:21
Speaker
Yeah, that's a good one, because as I started out by saying that everyone is talking about Gen AI, but not many people know what exactly they need to do. And and that's where the rubber hits the road. right like Everyone would like, from a data lead perspective, let's talk about the same example about introducing Gen AI into, let's say, in the form of a text-to-sequel setup where organizations can ask questions in a natural language form, and your model basically answers the question. right this is ah This is a problem space that is being worked on very heavily right now. There are companies. You mentioned David, who was part of a Delphi, and now he's working at Cube. And Delphi is one of the big companies that was trying like trying to solve this problem, right where you can use the idea of a semantic layer to basically expose that semantic layer. And we can talk into that as well. But then there's other companies which are pioneering this, for example. Seek.ai is another company who has is trying to do that at an enterprise level. And then Zenlytics is another company which basically are trying to solve this problem. right So the idea here is that they
00:22:19
Speaker
are providing an integration or sort of a plug-in to say so, like a tool where you can leverage and integrate your current data sets. And that can basically solve this kind of questions for you, right? However, it's not as simple as plug-in play, right? Because I've been speaking to quite a few non-technical folks as well as technical folks. Generally speaking, the general sentiment is, oh, can we just plug in AI into our strategy and it will work? Not really. It it doesn't work that way. like you need to If you think that you want to just have an integration of chat GPT with your company, and that's going to solve the problem, it's probably not going to solve the problem, right? So there are challenges associated with that. And I would say a couple of things that as a data leader, you would want to look and look into is number one, think of the problem effectively that you're trying to solve and find the tool accordingly, rather than first knowing, oh, we can leverage this tool. Now let's try to find a problem that can solve that can be solved by this tool, right? That's the wrong approach right now, because
00:23:18
Speaker
If you have fixated on a particular tool and you're trying to build a problem around that tool, you don't know in the space in this space things are changing so fast that by the next month this tool will be outdated and you will have a new tool already. right So try to find out the problem that you're trying to solve first. if If the problem is creating a way to reduce ad hoc requests that we have, now and let's go with the idea of what are the different tools out there and so on and so forth. right and Even if you do that, the other challenge that is more on the technical side of things is, is your infrastructure and ready to the point where it can support AI? right Because that's where the concept of Qube and what we talked about is the the notion of semantic lyrics. Because if you are going to point your entire data warehouse or a database to an AI model,
00:24:08
Speaker
It will work to some extent, but it's never going to be reliable for you. So you need to figure out a way in which you can have the infrastructure that supports an AI model as well. right So that's where the idea of semantic layer kicks in, and you need to increase relevancy of the question and the context that you're providing to your model, to your AI model. right And if you don't do that, there is a challenge that your data quality will suffer. And at the end of the day, in in data and I was reading actually listening to your podcast by David the other day as well and he mentioned about it everything relies on trust at the end of the day right so we need to be able to build trust about the things that we're doing as a data team if the audience if our stakeholders do not understand or trust the data that we are providing them it doesn't matter how fancy or how shiny the tool is at the end of the day right so finding a tool that can be that can be trustworthy and gives the
00:25:02
Speaker
ah reliability indication is very important. So that's also another challenge that you want to solve before you kind of jump into this. So and that's a I would say like a pitfall that you can fall into if you think that you want to just move fast and plug in here and it's going to solve things. That's not going to happen. that's i mean that That's what we spoke about with David, is this generative AI versus synth AI. Generative AI will force will generate you an answer. It is programmed to yeah give you a response. That's why we get these solutions and it can give
00:25:32
Speaker
<unk> and And then once you lose that trust, we we always talk about how big it is. It's so hard to gain from the business, but it's so easy to to lose. So you'd rather your your AI or your your stakeholder ask a question and and your AI say computer says no, rather than the generating something that's made up, it's just gonna then have an impact on on the business. So really interesting to to hear. and You've obviously there now, we've covered about how you need to so have the right tooling and then the right infrastructure in place to enable your your analytics, your insights, and your self-serve users. so I think that's a nice but so cutaway to, so I suppose, the other area that
00:26:14
Speaker
generative AI can impact and and that's data engineering, the the infrastructure behind

AI in Data Engineering

00:26:21
Speaker
it. So focusing more now moving back in that data lifecycle, how do you perceive the role of of generative AI in this domain as yeah as it's obviously a ah vastly sort of different different realm with different challenges? Yeah, I think Gen AI in data engineering is a hard enough to crack in my opinion, right? Because there is a lot of advancements now in, as I mentioned about in technology, generally speaking, right? Like around the usage of Gen AI. I mean, we talked about, uh,
00:26:52
Speaker
Devin, which is one of the Gen AI based coding assistant or code a software engineer to say so, right? That's basically going to help ah you yeah do software engineering by its own. And I think from a data engineering point of view, there is a lot of striking similarities in the way that to kind of ah get more, how to say, influence or get more inspired by the work that is being done in software engineering right now. So it's it's hard, but at the same time, it's very interesting on the kind of things that you can do, right? so I think, and and this is where I mentioned earlier, like I gave a presentation or a talk around, like, billionaire replace data engineers, because the nuance of work that data engineers do is still quite tricky, in my opinion. It's not a one size fits all, right? But at the same time,
00:27:39
Speaker
I don't think that the summary of that call, in my opinion, was that no, they will not replace data engineers, just for anyone who was thinking whether they will or or not. But I still think that there is a good use case where you can use generative AI as an assistive intelligence for data engineers, as opposed to just saying they will do everything that that you want to do. right So there are a lot of use cases, in my opinion, where If you look at the same as a data engineering leader or head of data engineering, for example, if you look at what are the things that causes pain or blocks our data engineering team on a day-to-day basis to do things, that's where you can utilize your AI to kind of increase the way of working or improve the way of working. right so
00:28:25
Speaker
I gave some examples that I thought would be really helpful. For example, majority of the data engineers, when they start working on a new pipeline, they would have to go through a set of code that they need to kind of create, run, and basically then be able to run things. These are these are called boilerplate codes. right so This takes a good, I wouldn't call it like significant or massive amount of time, but a good amount of time for for data engineers to basically be at a position where this is generated and then you get started working on actually building a pipeline, right? So imagine having something or or a tool that can do automated boilerplate code generation for you, right? And then you can work on top of that. That's the assistive part of an AI, I think is is where it's good, right? So I feel like these are there are more examples that I can go into it, but generally speaking,
00:29:10
Speaker
Data engineers stand to benefit from generative AI, if anything. There is a good chance that you would have more efficiency in terms of the work that you do. And and more importantly, something that sets apart a data engineer from in every other organization is the fact that they have more business context about things that they want to want to work on, right? And it sounds hard. It sounds odd to say that like data and engineers do not want to do spend more time with the business or so on and so forth. They have this very nice place where they kind of
00:29:41
Speaker
get the data, send it to business and basically just sit back, right? And basically let the data analyst figure it out. But I feel like that's not completed because some of the best data engineers I know they have a very keen understanding and a very good understanding of business. And that only comes when you have more time to understand the implications of the work that you have done. So If you want to be really good data engineer or want to have a team that is a very successful data engineering team, you need to be very embedded with the way that your organization or your business works. right And the only way you can do that, life is tough. Data engineers won't have enough time to understand the business. right The only way you can do that is to create more time to basically understand what the business does. And that's where Gen AI comes in, in my opinion, to kind of create that time. You can't magically create time like yeah literally, but you can use Gen AI to kind of create that time, in my opinion.
00:30:31
Speaker
Yeah, I couldn't agree more with data engineers becoming closer and and closer to to to the business. Again, another episode is not actually recorded yet, but we're we're going to be diving into to how data teams need to communicate better and internally with themselves and have more empathy you know for for what is going on either side of the walls. And and you know we we always talk about these silos. We need to break them down even within within teams and communication channels and and be that that one sort of fluid, well-oiled machine that's communicating well, because then you can process things much faster, set things up better, which is going to to enable the person downstream of you to do their job more effectively. Yeah. And and I've been recently working, so I also work from time to time as a consultant and a volunteer with with different data orgs from time to time. and
00:31:21
Speaker
There was a recent case where I saw that someone created a data engineer, basically created a pipeline to drop like basically ingest data into or load data into our into BigQuery data warehouses, which basically is row data, but each column each cell contains like a nested JSON file, which has like six or eight different nestings of JSON file. and In theory, data is there. You can work with it, but can the business work with it? Imagine having a data analyst to write these 16 levels of unnesting to be able to get a column value out there. That's not really nice. because and
00:31:58
Speaker
In theory, it makes sense that data is there, but I had data and you had a bit more understanding about how the business is actually going to use this metric or these data. It would make it much more seamless and smoother experience for the entire team to kind of work it. So this is an example where I feel There is a need for data engineers to be a lot more involved with understanding how the business uses data, how the business interacts with data, generally speaking, GenAI or not, generally speaking. And that's where I think an improvement area is yeah for for data engineering team as well. So the usage of data is also important, not just getting data from one place to another place, right?
00:32:36
Speaker
Yeah, and it's that context of what the data is being used for is is so powerful.

AI for Data Quality Checks

00:32:40
Speaker
It depends on on how you're going to to present it, structure it, and organize it. So we've obviously touched upon some of the areas. It seems that generative AI is is, again, an enabler of data engineers. How can a data engineer effectively utilize that? Because again, it's it's easy to say it can enable you to to be better. But we've already spoken about hallucinations and other problems with it. So so how can general data engineers genuinely incorporate this into their into their workflows. and
00:33:09
Speaker
Yeah. Yeah. I think that's, that's a good question. And I i wish I could ah tell you like five different things that people are doing already, because that's not true right now. There are people who are trying to figure out what is the best way, but I can tell you areas where ah investments are being made or people are trying to understand how they can use GNAI. Right. And ah one of the things that I would say is very important and as important area is is data quality. So you can you leverage GNAI models to basically check data quality for you, perform data quality checks that. you know like A lot of times, data engineers will build monitoring and basically observability dashboards to basically understand what's happening, right which is fine, but then this is very limited to the understanding of data that you know so far. There might be new things that can come up that you don't really know. and By the time you find this out in a dashboard, it's already a bit late before you can basically say that we need to make changes, we need to refactor code. right
00:34:04
Speaker
so That's one area where I see that you as a data engineer can leverage an AI to essentially and do data quality checks for you, understand the pattern of the data, and basically see what is it that stands out for you. If there are a new parameter or a new code or you integrated a new data set, and essentially it changes the way that your revenue or other calculations can will change or the number of, world that's a very simple check, but generally speaking those will change. you would have a better understanding of your data quality as opposed to just doing it yourself. That's one place where I can see that there is an important area. and
00:34:42
Speaker
Another thing that IC is very important and which you can leverage on is data lineage is is an area where there have been a lot of advancements now. So if you go to DBT or something, for example, they provide this really nice lineage graph for you, but that's not the case for a lot of data engineering pipelines to say so, right? So if you are new to an organization or if you're trying to figure out a bug that you encountered that you didn't know, not having those lineage is going to be a painful process. And ah you can leverage an AI to basically understand the code that you have so far and build that lineage for you, right? And once you have that lineage, you can also leverage an AI to optimize that for you as well, right? So those are the places where I feel like it's it's very, very important. And and last but not the least is it can have a bit of, and there are more development that needs to be done in my opinion in this area is on documentation.
00:35:34
Speaker
No one, data folks, and no one likes documentation. That's that's going to be the reality of the case. And by the time you finish it, it gets outdated already. So finding a way in which you can leverage AI to basically do documentation is also very good and very, very nice approach to do that. So once again, going back to analytics engineering, engineering dbt has a very nice and cube for certain as well has a very nice documentation portal and things that you can do. But it's static, right? It's limited to the documentation that you have done. So how can you use AI and in that space to basically build that documentation as well is also very important. Scan your entire datasets, understand what is the definitions, business and definition, and basically start creating documents for everyone. That's going to be an area where I feel like data engineering can do ah better.
00:36:18
Speaker
I think that one's probably one of the one of the easiest, that some of the lowest hanging fruit anyway at least. ah I think obviously documentation is so important to help people understand what's going on and yet it's not necessarily a task that people enjoy doing so we can automate that away. But yeah, I think there there's obviously so many tools being up. I know that there's one called Turing, Turing Tech AI, and they ah refactor code and can go from code to code, whether that's yeah Scala to to Python. And, you know, you can they can let there their AI essentially save the time of what it would take engineers, you know, weeks and months to do it. and
00:36:53
Speaker
I know my day-to-day job, I speak to many engineers that have spent the last X amount of months refactoring code. so you know there There are these tools which are going to be developing and improving and and learning which are ah going to come.

Integrating AI Tools: Challenges

00:37:06
Speaker
yeah It's great that you mentioned Turin because I had a demo with them last year last month actually where I went through spoke to Mark from from Turin Tech and It's quite amazing. It's quite a brilliant use of like good advancement in where I see Gen.I can be utilized. Because a lot of, this is also an interesting thing that a lot, ah I mean a lot like majority of individuals, things that in order to implement Gen.I, we have to start from scratch and basically do everything from now on, right? But an organization has spent insane amount of resources and effort has been made to get us to get us to the place where we are.
00:37:45
Speaker
So simply ditching that and basically just using new things is not going to be, or it's not the most reliable. Yeah, it's not going to be. What if, as I mentioned, what if tomorrow there's a new tool in the market and then you have to kind of restart from scratch? That's not really nice, right? So refactoring and and basically optimizing what you have done and making it more AI ready, I can say so. is an interesting area. and And I'm really impressed by what Turin Tech is doing in that front. So yeah, that things like that are something to look out for. I think you you hit it on theneath on the head for this, that these are the kind of places where I see as a data engineer will definitely benefit and should leverage in terms of a code-based refactoring because migrations are painful and no one likes doing that. and And if you could find ways to kind of assist you in doing that, it would be pretty impressive.
00:38:32
Speaker
Yeah, 100%. I have been fairly impressed. And I think, I know Turin, I spoke with Mark as well recently. And yeah, he, the clientele that deployed their tool on, you know, we're talking about established banking organizations, um the biggest in in the UK's organizations, which have got huge amounts of value from a tool. And I have been maybe skeptical, but but Mark said you for their tool, their AI, they they run it on premise and then you don't lose any of your priority check, which I know is ah is another big factor within the the AI space, isn't it, is that I suppose governance and and you know what data protection essentially. but
00:39:11
Speaker
Maybe I can get a short piece of the view on that, but I'm sure again, that's another whole episode in it itself. that is That is like code refactoring and data governance and privacy is the next level altogether. that's I feel like that's the problem that a lot of us need to solve and the overall narrative has been we'll get it, we'll solve it when we get there kind of thing, but I think we need to actively start thinking about or at least implement those things and how we can leverage or how we can take care of or be mindful of governance and and privacy to that. yeah Yeah, so it's something to it to be very mindful of and and just work with us as as this grows, I think is the the key takeaway for the audience. I suppose, look, we've we've obviously mentioned quite a few few tools already. I'll put a link to to them all in in the post and to the the podcast, because I'm sure people will be keen to understand and see what's else on the market. Have you got any other tools you know which you think people should be aware of or should yeah just just have on their radar as potential future future value fur for them and

Automating Hypothesis Testing

00:40:10
Speaker
their team? So I think we covered the ones on the analytics side of things. You mentioned Turing tech for code optimization and that I think is really good from a data engineering point of view. ah One thing that I want to that I would add is covering the third aspect of data, which is the data science side of things, right? Because I've spoken to quite a few people and one of the tools that stands out to me is Infer and basically they are
00:40:36
Speaker
doing similar things for data engineer that we talked about for data engineers, but for data scientists. So running a hypothesis and testing and running experiments and basically doing all the statistical analysis as a data scientist is not an easy job, takes quite a lot of time. Finding the right models to basically identify a hypothesis and find a solution is even harder. So what Infer is doing is basically they are helping data scientists reduce that barrier to entry, help them pick up the data models through those boilerplate coding and basically enable the data scientists to do maybe three to four times more effective hypothesis testing and data science analysis as opposed to what they could do before. so
00:41:19
Speaker
This is also leveraging AI to basically understand what is the data, understand what is the kind of analysis or algorithms that would have a better indication of the success or failure of this experiment, and then leverage it and use that already. So before the time the data scientist can start actually looking into the results already. the part that leads to the generation of result is already automated for you. right And I think there is a lot of value in that when a data scientist can utilize actual skills of analyzing or like doing those understanding those experiments as opposed to just preparing a code base that will help them analyze the experiment. right So that's another tool that I feel is a lot of value. And it's a UK-based company and and really good stuff that they're doing here.
00:42:02
Speaker
Yeah, I know the the team, for my understanding as well, is is is iss essentially that they can they they can give analysts the power of of a data scientist. You can code in in SQL um to get far more advanced predictive understanding with just the use of SQL. So again, it it lowers that bar of entry, which is, again, that enablement that was spoken about. Exactly. I agree. Yeah. so i think From what I've taken away from this call has been that that we're not going to be replaced. Data professionals are not going to be replaced by AI, but instead need to learn to to harness and and maximize its ability and incorporate into that their day-to-day lives to give them more time to to focus on what on what matters.

Future of Generative AI

00:42:46
Speaker
I suppose final points from from you, Gaurav, would be where where are we going with with generative AI? you know What is the future looking? I don't think we can look five years ahead, but but maybe six months to to a year. Yeah. and That's a good thing. like I think ah I was talking to someone about this and basically the TLDR version of that is, I think the cat at this point in time is out of the bag. Everyone knows that there is huge advantages that jenny I can offer on a day-to-day basis. There is huge implication. If we talk about in the six months phases, 2023 first half and second half of the year was very much about
00:43:24
Speaker
First off was like, wow, what is this thing? like What can it do? Even though we know that AI has been in place, but OpenAI basically made it available for us to interact with something and made the barrier so low that anyone can use it and created that awareness. right The last six months have been very much around what can it do for us now on a day-to-day basis. right And there is a lot of thought process, ideas, and some early development, some somewhat mature developments happening. right The next six months, in my opinion, is where you would actually see implementation on a much larger basis. right So far, we have talked, there are these companies that can can do things. People are spinning up these like organizations or things, as someone said, like ah and they're in their home office or something, which makes sense because the barriers to entry is pretty low right now.
00:44:11
Speaker
And therefore, you would see more people coming out with ways to solve things, problems in a more reliable manner. Right now focuses on who is the first one to solve this problem. But as you kind of evolve ah in the next six months, I see more and more organizations focusing on reliability and maturity, as opposed to just finding innovative ways to solve things. Right. Because now. the organizations, so the people that I speak to, organization leaders that I speak to, they are not just interested in knowing what it can do. The question would be how can it do things reliably? That's the more important question is going to be coming in. So anyone who is listening and who wants to understand where do we see ourselves going, you would see
00:44:52
Speaker
equally big improvements in coming up with new ways to do things, but you will see organizations getting much more in-depth with how much can they leverage AI and how reliably can they leverage AI. So if you are someone who is developing an application for it, focus on it. And that's where because a lot of organizations will be asking these questions like, what is the reliability like? What is the cost like? What are the so far and so on and so forth things, right? So getting something as an MVP versus putting a production-ready model that or production-ready a software is where you should be inching towards and is what we will see now. More companies will come up with more production ready stable models or stable softwares to solve a business need. So that's where I feel like the rubber is hitting the road now and you will see more of that in the next six months now.
00:45:36
Speaker
yeah Yeah, I agree. I feel like we're in that implementation phase. and And then after that, we will inevitably iterate and and improve and um and progress. But yeah, it's I think real world applications at scale are are on the horizon. Thank you so much for for your time, Gaurav. It's been hugely insightful. while I've loved loved the conversation. And yeah, I hope to to do this again sometime in a year, 18 months to see how far how far we've come since then. Absolutely. it's It's been a pleasure for that. Thank you very much for having me.
00:46:06
Speaker
Perfect. Thank you very much, everyone. See you next week. Bye-bye. Well, that's it for this week. Thank you so, so much for tuning in. I really hope you've learned something. I know I have. The Stack Podcast aims to share real journeys and lessons that empower you and the entire community. Together, we aim to unlock new perspectives and overcome challenges in the ever-evolving landscape of modern data. Today's episode was brought to you by Cognify, the recruitment partner for modern data teams. If you've enjoyed today's episode, hit that follow button to stay updated with our latest releases. More importantly, if you believe this episode could benefit someone you know, please share it with them. We're always on the lookout for new guests who have inspiring stories and valuable lessons to share with our community.
00:46:55
Speaker
If you or someone you know fits that bill, please don't hesitate to reach out. I've been Harry Gollop from Cognify, your host and guide on this data-driven journey. Until next time, over and out.