Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
001 - The Power of The Modern Data Stack  image

001 - The Power of The Modern Data Stack

E1 ยท Stacked Data Podcast
Avatar
482 Plays1 year ago

I'm joined by Sami Rahman as we dive deep into the fascinating world of Modern Data, covering topics around maximising your stack and the power of a semantic layer.

In this episode, I had the privilege of sitting down with Sami, who has a truly unique path into data. Together, we explore the intricacies of the semantic layer and modern data stack, unearthing its transformative potential and its vital role in modern data architecture.

๐ŸŒ What You'll Learn:

An honest insight into Sami unique career journey (including a stint in ANTI-TERRORISM)

Insights on harnessing the power of the modern data stack and a semantic layer to unlock data's true potential.

Real-world applications and success stories that shed light on its impact.

Practical tips and strategies to navigate the challenges of implementing a robust semantic layer.

Top tips for interviewing and a successful career in data!

This podcast isn't just about technology; it's about the stories, experiences, and lessons that drive innovation in the data landscape. Whether you're a seasoned data professional or simply curious about the future of data, this conversation offers a wealth of knowledge and inspiration.

Please give us a follow as we have lots more episodes coming!

We are always looking for feedback or topics you'd like to hear about. Please reach out!

Recommended
Transcript

Introduction to Stacked Podcast

00:00:02
Speaker
Hello and welcome to the Stacked podcast brought to you by Cognify, the recruitment partner for modern data teams hosted by me, Harry Golop. Stacked with incredible content from the most influential and successful data teams, interviewing industry experts who share their invaluable journeys, groundbreaking projects, and most importantly, their key learnings.

Interview with Sami Rahman

00:00:26
Speaker
So get ready to join us as we uncover the dynamic world of modern data.
00:00:35
Speaker
Hello everyone and welcome to the first episode. Today I'm joined by Sami Rahman. He's the head of data platforms at Penguin. Penguin are the UK's largest publisher. You've probably read some of their books. Penguin work hard to stretch the definition of the word publisher and they make books for everyone because they believe a book can change anyone.
00:00:55
Speaker
Today, me and Sammy talk about the power of the modern data stack, how Penguin is maximising the tools it has to drive business value. Sammy also uncovers how they implemented a semantic layer, the key challenges and lessons, and what he'd do differently if he could do it all again. Sammy's journey into data is incredibly unique. He's open and honest about his thoughts on the modern data world and shares some incredibly valuable lessons. I hope you enjoy our conversation.
00:01:25
Speaker
Hi Sammy, welcome to the show. Thanks for joining me today. How are you doing? Good, how are you?

Sami's Role and Background

00:01:30
Speaker
Yeah, very good. It's Friday afternoon and we're in the Penguin offices and really excited to dive into the power of the modern data stack. But first off, for the audience, it'd be great if you could give us a bit of a background to yourself, how you've come to Penguin and what your role is at Penguin.
00:01:44
Speaker
Sure, so I'm Sami Rahman. I'm currently the head of data engineering and data platform at Penguin Books. But yeah, I studied psychology in uni. And I worked as a psychologist at a tech company, where I kind of started seeing all this data and analytics stuff. And we did a lot of statistics and psych, but I started really appreciating it there.
00:02:03
Speaker
When I did a master's in counter-terrorism at UCL, and I did really short volunteering semi-paid stints in counter-terrorism with private intelligence, I mostly focused on ISIS, al-Qaeda, Boko Haram, my Islamic liberal front, basically jihadist terrorists.
00:02:21
Speaker
It was so goddamn boring. I've never been so bored in my life. But then I finished uni, became an MR5 reject, I didn't get in. And I literally spent eight months unemployed. I was learning Python for Udemy, and how to do SQL and GCP, building a Kaggle portfolio, how good I was, how to do science.
00:02:41
Speaker
After 326 rejections, I got my first job as a data scientist at a WPP agency called Essence, or the Google Agency, and we used to have, they were our main client and we did a lot of training there. Went over to HSBC as a senior data scientist after a year and a half. After a week, I got promoted to product machine learning manager. After three months, I got promoted to a, my job site was on that VP fraud data and machine learning services manager.
00:03:06
Speaker
But I was in charge of data science for Cardford in the UK. I had about 40 or 50 people there. Did that for a year and a half. So I did that for a year, but I was at HSBC for a year and a half. I left joined PwC in the deals division, kind of looking at how we merge technology people and how proper friends have technology running across various clients. Sounds awesome, but all I did was PowerPoints, and I got headhunted by Penguin in my second month. And yeah, I was told, but my boss,
00:03:34
Speaker
penguin ship and we need a lot of help and I thought this is going to be fun. So I joined and helped build. So my role here kind of focuses on not only building out the data platform, which is mostly done, but also focusing on what the future data capabilities the company needs. How will the market change?
00:03:50
Speaker
And what will we need technologically, people-wise, in our team, but also in the rest of the company, to upskill and keep us as a modern data company and a data-literate company as well. But yeah, that's me in a nutshell, but I will be leaving in a month to be the director of data at a global fashion company called Hypebeast.
00:04:09
Speaker
Well it sounds like a fascinating journey into data and goes to show if you put in the work and really try and understand what you're doing and put yourself out there that you can make anything possible. Today we're obviously going to be talking about the powered Monday stack and it sounds like there was a lot of work to do at Penguin so let's dive in from there.

Modern Data Stack at Penguin

00:04:31
Speaker
Sammy for the audience it'd be really good if you could just break down what penguins stack is and how you guys utilize it. Sure so we have quite a few tools we're strangely quite mature compared to most places but the history behind that is
00:04:48
Speaker
We didn't actually use the cloud until two years ago. We had physical on-prem servers from IBM and Oracle. I don't know what that tech is this way before my time. And two years ago, we decided to move to cloud, but we decided to have modern data stack. So we do have AWS and Azure, but we don't really use the components within them too much. Our primary storage and compute is Snowflake.
00:05:12
Speaker
integration and adjustment tool is 5-Tron. Our transformation tool is DBT. We use Airflow for orchestration but we're going to get rid of it in favor of Dagstan. We can explore why I'm going to do that a bit later. We use currently elementary for our data quality monitoring and testing. Now that's kind of the core of the platform where the nerdy engineering stuff happens that no one cares about.
00:05:36
Speaker
but on the sub-platforms or microservices, however you want to call them. For our visualization stack, we use Azure's Power BI in conjunction with something called ALM and Tableau BI to make that system working better. We are looking at Fabric, which I imagine most people are as well, and there are pros and cons to that definitely.
00:05:55
Speaker
For our data governance tool, we use data.world, which is also kind of like a medium-sized player that some people have heard of. For our machine learning platform, our MLOS platform, we use tools that most people haven't heard of. For our ML platform, we use it for LoFi, a small finished player.
00:06:12
Speaker
We use Feast, which is the open source tool by Tekton, and we use evidently AI for monitoring.

Semantic Layer Implementation

00:06:17
Speaker
We are also building out a low code, no code analytics platform, but that's fairly straightforward because we're using AWS SageMaker Canvas for that.
00:06:27
Speaker
Amazing, that's a really good summary. And how long have these technologies been involved in the penguin ecosystem? Is it relatively new? So Snowflake, 5tran, dbt, Power BI have been around for two years. Data.world has been around for one year, 11 months, and I kind of joined. So I inherited those things, which was quite funny because as I joined penguin, I didn't believe in the modern data stack. I thought it was bullshit. Like why don't you just use everything on
00:06:57
Speaker
GCP or AWS. But then I really started seeing the power of it and why you would actually buy third party tools on top of your cloud platform. And then
00:07:08
Speaker
I kind of start seeing where are the problems, how could we be better, and that's how we got inside our core elementary airflow, and that's how in this stack of our A11 tablet BI, and then we got things like Velokhi, Evidently, and Feast for MLOps platform and Canvas as our low-code and no-code platform as well.
00:07:30
Speaker
Amazing, amazing. So one of the things we've spoken about previously and I'm really keen to dive into is your work around the semantic layer. It's obviously a really new topic and keen to understand, one, I suppose, what your definition is and help explain that to the audience. And then, two, talk us through how you leverage your semantic layer. Sure. So it was such a big thing last year. It's become quite quiet now with Genitive AI. But the fundamental short definition of a semantic layer
00:08:00
Speaker
is a single source of truth where the people in your business access that data they need. Now, where everyone gets really confused is there's kind of four implementations of a semantic layer. Two of them are a semantic model, which most companies will do. One is your business intelligence semantic model. So that's in your Power BI Tableau. That's the layer everyone's accessing. But there's a lot of problems with that. And I'll go over what that problem is.
00:08:26
Speaker
Once upon a time data teams, all they were concerned about was BI, but we're doing machine learning now. We're doing loads of different data capabilities where we have different tools. You don't want all your tools to point out your BI solution because that's going to overcompute. It's not the most appropriate way to actually have your stack kind of connected together. But that is traditionally where most people would have a semantic model, but there are problems with it. The next semantic model
00:08:52
Speaker
is in your data warehouse, so most places have a data lake or a data warehouse structure, but that's all kind of in physical models that only the engineers understand. So this is probably the most piss-easy thing most companies can do, make another account on Snowflake, BigQuery, whatever you use.
00:09:08
Speaker
and use that as your business friendly kind of level. We do the first two but there are problems with it and that's when we start looking at semantic layers as well. The other is like a universal semantic layer where you actually have a specific modern data stack tool like Lumix at scale. I can't remember any others to actually do that for you.
00:09:27
Speaker
And the fourth one I believe is if you have like an in-stack semantic layer, so if you use dbt for it or something like that. But the last two versions I talked about, the universal one and the in-stack one, they're probably the best ones. Because in the first two examples, you're still having your engineers kind of sifting and trying to figure out how they will fit together. Whereas the last two really enforce state of governance.
00:09:50
Speaker
So a really good example is you take three departments in your company, sales, marketing, and finance. If you are a B2B or B2C company, sales might call a particular data point customer, marketing might call it prospect, and finance might call it party, but it's all the same thing. And in the semantic model examples I gave earlier in your BI tool in your warehouse,
00:10:16
Speaker
you'd probably have three separate tables for them and you might not join them. But in your semantic layer, you'd actually have one universal definition across your entire organization, which is where the word universal semantic layer comes from. I'd say, all right, we're going to call this person, customer, blah, blah. And we know when sales calls it customer or marketing calls it prospect or legal, legal and finance call it party, we're all referring to this one thing. And the beauty of this is it actually makes
00:10:44
Speaker
all your tools, how it connects to your data a lot easier. Because now you're connecting to your semantic layer, your BI tools looking at your semantic layer, your governance tools, your analytics tools, your machine learning tools. And it's just looking at the unified business language. And actually you're doing analysis from there. Because the real problem of not having one is, you know, whatever meet data product you build, you're going to get feedback and changes from your end users.
00:11:07
Speaker
And that's where the data, I call it data divergence. I don't know if that's a real term. But if you take a very common thing nearly every company will do, which is what's the landscape of the digital marketing, it might actually diverge in your BI stack versus your machine learning stack versus your geospatial stack, stuff like that. But that universal semantic layer, because it forces everyone to look at one point where it's connected to governance and quality monitoring,
00:11:35
Speaker
It's the same everywhere. So if somebody requests a change, if someone looks at a BI report or a machine learning model and says, hey, can you create this new variable or rename it? It happens at that universal semantic layer and if it was everywhere, there's no divergence.
00:11:51
Speaker
So it's, as you said, that's a single source of truth where you have sort of unilateral competence and everyone agreeing on the single definition. Yeah. And even if everyone doesn't agree, because that happens in that one tool or that one place, it gets filled everywhere. Like DBT has been really hot on it, even though everyone's just talking about genres of AI right now. Still not fully nice or fleshed out to do it in DBT, but
00:12:19
Speaker
most places use dbt so that's probably the path for least resistance and i think these tools are getting more popular the challenge most places will have they're not cheap so i think the thing to look at is can you have like a multi-faceted semantic layer so dbt is in a pure semantic layer it's a transformation tool and semantic layer at lumix isn't just a semantic layer it's a semantic layer and enterprise data catalog
00:12:44
Speaker
at scale is a semantic layer first, but also a feature store. So I think when you kill two birds on stone, you can kind of justify that cost a lot easier. I think there are other companies that do think similar like Denodo and stuff like that. But if I remember, that's really interesting. And I suppose that's a very nice segue as to what were the steps that you took to implement a project like this. So the first thing we looked at is
00:13:09
Speaker
We kind of realized, because I'm in charge of data engineering, data platform here, but we kind of saw, all right, we have the architects on our team who decide how it looks like in the warehouses, but then our BI people have data modelers who decide how it should look like in DAX before it's in PBI. And sometimes a user would say, hey, why is the data like this? It shouldn't be like this. It should be like how it is in the source system.
00:13:35
Speaker
And then, you know, looking at a warehouse and we're like, we haven't done that. Where's that come from? And then realize that's where the divergence happened. Or it's not the same on our governance catalog and we have a governance team here. So that's when you kind of realize, ah, actually, if we had one bloody thing that made it the same everywhere, it would have less problems.
00:13:54
Speaker
So first thing we did was actually, do we have anything on our cloud platforms or in our modern data stack where we can just use it straight up? So we played around with dbt semantic layer, we'll private preview for it. And the challenging thing is we're already quite fleshed out. So we're like, oh, we have to implement this everywhere. That's quite long. So we kind of picked like really high value use cases where there's loads of changes and requests against the data.
00:14:21
Speaker
So we tried on a really small scale there. But then we started looking at actually getting a universal semantic layer tool specifically for this. So we kind of looked at scale and another one whose name escapes me, but they were quite expensive. But that's the kind of steps we took.
00:14:37
Speaker
The main thing is to ask if you need it, because if your data stack is literally a warehouse BI, you probably would benefit from it hugely. If your modern data stack is literally warehouses, BI, governance, machine learning, and more automated systems like synthetic data generation, because you have all those multiple sub-platforms, that's where you really benefit from it.
00:15:00
Speaker
Then the other step we took was like, can we do this at a small scale and see the impact of it? And then what's the value generated back towards us? And does it actually, weren't actually getting in something and how hard is it implement? Cause some of these things, they sound great in a short definition or great on a website or on paper, but looking at the implementation of it, if it takes you a year, then it's probably, if it takes you a year, but it doesn't save you two or three years, it's probably not worth it.
00:15:25
Speaker
Yeah, yeah. So that's really interesting. So the need for a semantic layer is driven by the complexity in your stack and your ecosystem. Yeah, 100%. And what value does it then drive? You know, when you have this complexity, what is the specific value that you guys have seen a penguin from what you've from what you've built? So in these small ones, originally, we were just thinking about metrics the business cared about. But actually having
00:15:53
Speaker
we do use some of dbt's one but actually having that made us think much forever about what are the other metrics and products we can create from this so a great benefit is data lineage which i won't lie the data team probably uses more than everyone but if you're in the business and you're asking where did this data come from you can actually see what are the sources that came from how it merged together and actually what is the recipe makeup of data which is
00:16:21
Speaker
when you're really deep diving is very useful. The other aspect is with the Universal Semantic Layer, you can really get good metadata and metadata profiling. So what's the normal structure of this data? How often should it come in? What's the materiality of it? What's the risk of it?
00:16:39
Speaker
And that actually helps prioritize the data team on if this thing broke, how quickly and we should always fix them straight away. But how important is this piece of data to the business? How much is it used, consumed across everything? Because without one, without one, you could only really see within each separate tool how much was it used. But when it's all universally defined and merged, you can see, for example, Amazon is quite important to us.
00:17:06
Speaker
can actually this Amazon data set is consumed in every part of our data platform at such a massive volume that this as a data asset which is a big thing a lot of leaders talk about all the time what is the value of a data asset you can actually get some of the metrics to try to define that. Interesting and I suppose that's what the executives are looking for is in that definition and something tangible which they can measure. So
00:17:30
Speaker
It's a really interesting product. It's clearly driving value. What were the pain points? How when you were implementing this, how did you get to where you are? And what sort of solutions did you come up to with them pain points? Some of the pain points initially was understanding because on paper is quite straightforward. But also on paper, the question you get asked back is why can't I just spin up a data mart or a OLAP or a separate warehouse and do it?
00:17:55
Speaker
But those things aren't always super flexible or integratable with things downstream. Like actually a really great example for that we came across was we had a semantic model in Power BI. So the questions we were getting back from the business from the team, why do we need to do this again? Why can't you just connect everything to Power BI? But then the problem you have there is that your machine learning systems, all these downstream systems,
00:18:20
Speaker
and now really impacting the performance of something like Power BI and you're kind of using it for a purpose isn't meant for. But then the governance team actually asked the same question. Hey, we've defined all these things as catalog items and glossaries inside our data governance. So we should be this semantic layer. It's just like,
00:18:36
Speaker
then you have to explain to them why an enterprise catalog is not a semantic layer. There are ones that exist that are both things in one, like the cube and the Lumix, but unless you're using one of them, it's not really appropriate and you're overloading that system where it's not designed to actually pass through data and monitor and combine data in that particular way. But the way we kind of got through the first part was education.
00:18:58
Speaker
because it's not super obvious right away what it is. The next aspect, which was a pain point, is do we do this on everything or a bit of it? But even the way you program in DBT, defining a semantic layer isn't always straightforward. The other really hard thing that I think a lot of people come across is a semantic layer isn't necessarily always integratable with everything. This is so especially true of things on cloud platforms.
00:19:25
Speaker
So everything worked with data.world and for low height, but our dbt semantic layer didn't work with LBI, which is our most consumed tool. And I still remember having a meeting with Microsoft saying, Hey, I want to do this semantic layer thing and dbt are willing to have an integration with you. Are you willing to have an integration with them? What they said back to me was like, Oh, what's your data stack? I was like, yeah.
00:19:50
Speaker
exactly what I said. Snowflake did, all that. They said, oh yeah, you should just get rid of it all and just use Azure and everything. I'd never been so furious in my life. I was like, you want me to get rid of 13 tools for the sake of your one tool? It's probably easier for me to just get rid of your tool and go with ThoughtSpot or Looker. And they kind of pissed themselves at that moment and then they got in there
00:20:11
Speaker
Should I be admitting this? They've gotten their Power BI product team from Germany and America and dbt bought theirs and they've started a relationship to integrating them together But I've got away two years and this was a year ago. So to kind of fix that pain point we have the dbt semantic layer kind of going into like
00:20:29
Speaker
like an Azure server, and that is connected to Power Query, and then Power BI is looking at that. That's one thing to consider actually. We're the same thing with Ascale as well. It's a great idea on paper, but it doesn't necessarily mean the thing you're picking integrates with everything. So that is one thing you need to look at. Just because you vibe with the company, you get what they do.
00:20:51
Speaker
you love the UI, everything about it doesn't necessarily really work with your stack. So checking integrations, planning ahead, making sure everything is going to work together is important.
00:21:03
Speaker
That was literally my mistake because the whole definition of the somatic layer stuff, you're like, oh, yeah, should work on everything. It does, but it doesn't. Okay. Well, that's interesting. I'm sure, yeah, that's hopefully helped some people listening. And I suppose on that, if someone was just starting this project in their own company, what would be your biggest piece of advice for them as they embark on a project like this?
00:21:24
Speaker
Great question. I think this piece of advice I'd give is in isolation, it's somatic layer tool is super expensive. And I think it will save them a lot of time. But where they could be more effective is actually picking one that does more than one

Managing the Modern Data Stack

00:21:42
Speaker
thing. And I know that's not the cool thing to say in the modern data stack. But if you get a somatic layer that is also a transformation, DAG transformation tool, enterprise data catalog,
00:21:52
Speaker
feature store, then they can have multiple uses for it. And it becomes much more important to their second because much more easier to justify to finance and your bosses as to why you're getting it because it does so much more things. And actually, some of those things I mentioned, semantic layer and transformation, semantic layer and catalog, semantic layer feature store,
00:22:11
Speaker
It super makes sense and actually becomes much more powerful when they're together. Okay, perfect. One of the things that's become clear talking is the modern day stack has a lot of tools. The more tools you get, obviously they're great to the complexity and the harder they are to manage. How do you guys ensure you've got observability and management over your stack and the data within it? We use elementary, which they're kind of two separate things anyway, data observability and infrastructure observability.
00:22:37
Speaker
So if we don't observe ability we use elementary, which is amazing. I can't believe it's free. It's literally changed the game for us on our data quality and how well things operate. There's one unique challenge with elementary only integrates with dbt. Now, the reason that is a pro and a con is
00:22:55
Speaker
any transformations you're doing, you're finding out where the problems are, the anomalies, the changes, if the time is wrong. Now, dbt doesn't run on everything. So for example, when you're getting rule data and dbt is only extracting it, when you're sending it from dbt to your viz stack, your governance stack, machine learning stack, and anything else you might have.
00:23:16
Speaker
DBT isn't really doing anything there. So elementary is also doing nothing. So we're actually currently in the process of looking at a enterprise de-observably platform. So we're looking at Soda, Monte Carlo, and Ciflips. But we're currently in the process of picking one. So I can't say which way I'm leaning. But it seems like a nice to have. And I honestly believe deals are nice to have for a very long time. But it is a massive game changer to a team's time and their quality of life.
00:23:44
Speaker
And a lot of these things, even if you want to go free in your team with elementary or redata, or Great Expectations is the most popular one, you will see massive changes to how you operate with the company because they trust the data and the infrastructure observability. You can do something funky, which is use like Airflow, Dagstar, or even Kinesis on AWS to have the event receivers and event producers just monitoring the pipeline execution in each tool.
00:24:13
Speaker
That's a sneaky way to do it to make sure the data flows actually worked and everything worked. But we do use like the AWS vanilla stuff like CloudWatch and X-ray to make sure that all the machines are working how they should. So that's how we, one part is data observability, other parts of infrastructure observability. And do you have any sort of metrics or how do you assess how effective them tools have been? And you mentioned obviously, you know, it's making your team's lives easier. Did you measure that?
00:24:41
Speaker
yeah so one aspect was i'll start with the boring ones and the more specific ones but one aspect was how many complaints we got we used to honestly get about 40 to 50 a week because a lot of the systems we're connected to this is so industry specific to us literally 60 yard systems from like
00:25:02
Speaker
some other book companies which people are familiar with we're using a 40 to 60 complaints a week from people saying this is wrong it should be this blah blah it should be structured like this why is there chinese letters in this blah blah blah we're very happy to say we get three a month now
00:25:17
Speaker
which is massive. We always had a floating incident number of around 80 to 120. It's literally four now. And those four have nothing to do with data quality. So it's almost like zero. Amazing. So your team has more time to focus on development, new developments, new tools rather than Yeah, rather than like debugging and doing
00:25:39
Speaker
incident management, they're literally doing analytical engineering, they're doing MMOps engineering, automation tests, they're having a lot more fun. So am I, because less people are annoyed with us all the time, driving more value as well by the sounds of it. Yeah, actually focusing on real work than fixing mundane work. But then there are very specific metrics. One was freshness.
00:26:00
Speaker
So like, we know when certain data must come in. How fresh is it? So it has a common every hour, has a common once a day, once a week. Schema changes. Has the structure say similar? Anomaly detection. So is there anything really out of whack for how this data should look? And there wasn't DBT tests as well. So we track that as well. How many like unit tests, errors, system tests, errors did that occur?
00:26:24
Speaker
So there was a lot more metrics now than we ever had before, because that was a question I used to get. Can we trust the data? There's so many mistakes. But we're like, Oh, here actually, look at this dashboard from elementary, nothing's been detected. Obviously, it's not 100% foolproof.
00:26:41
Speaker
But the people in the business are much happier with the reliability of the data. And that's so, so important. Trust is hard to win and easy to lose. And there's no point in building all of these cool data products and dashboards if the business isn't going to trust them, right?
00:26:58
Speaker
Amazing. So you guys as Penguin, I think it's fair to say you're quite a creative business and there's quite a lot of creative profiles within the business rather than the typical technical that you might expect to maybe more tech based businesses. The modern day stack has lowered complexity of some tasks. How have you guys taken advantage of this?
00:27:20
Speaker
our favorite and the team said they'd kill me if I ever tried to get rid of it. It's probably five try and even when I was more hands-on I still remember what a pain it was back in the day of writing a Python script with curl and bash and making sure it worked to pull an API. While it is relatively straightforward it's tedious and when it breaks it's annoying because they've updated it but now
00:27:43
Speaker
If we want to connect to Facebook or Amazon or wherever the hell it is, we literally put in the credentials and it's up and running in two minutes. Whereas writing API scripts, give it to a junior person, take them a day, give it to someone a bit more senior, they do an arrow or two, but now it's just not something we think about anymore. Same with DBT, some of the stuff that happens in DBT, how it looks at seeds and sources and CTEs.
00:28:07
Speaker
It used to be a very long process of writing stored procedures, and it was so specific to the individual writing it. But now all of those DAGs in DBT, the way we write them, the style guide of it, and how you structure it, and it does it automatically, is much nicer. Because it fits in with Git, the entire team's reviewing it now, whereas those stored procedures never did before.
00:28:31
Speaker
but also our data governance tool is looking at it, always feeding back what is the lineage of all of this from DBT's lineage, what are the metadata profiles. Now there's a lot more visibility and transparency on it. Whilst it's also a lot quicker, it's got much better survivorship than ever did before as well.
00:28:48
Speaker
And on the creative profile side, there was like, when we first came around to the business, like, why do we need this? I know why I need to sell. I know what kind of, what's the hot thing in our world. But once we increase that trust and data, the speed of data, the reliability of data.
00:29:04
Speaker
And actually, the accessibility of data, which is probably the most important thing that I never talked about, we started helping people learn, actually, this is how you read reports. This is how you build your own reports. This is how you do your in SQL, a lot more people making data informed decisions to see if they got feeling was right or wrong. So that's been quite nice as well.
00:29:24
Speaker
Nice. That's what everyone's chasing, right? Data driven decision making. I'd say the biggest thing about the modern data stack is developer experience. They've just taken away, this has been true of all the tools we've seen. They've just taken away the things that if you were to do it on a pure cloud platform, a cloud native platform,
00:29:43
Speaker
There's just less configuration, less infrastructure management, a lot less network coding and stuff like that. While all of that is fascinating, and I've mad respect for people who are really into it, for a lot of data people, regardless if you're a scientist, engineer, analyst, governance, strategy person, it's just not as interesting. They're more interested in the data side of things rather than the computer side side of things.
00:30:07
Speaker
So I think that's where the modern data stack actually really empowers all of us. It's actually to focus on data rather than computing. And the data is what should be driving the value for the business.

Future of AI in Data

00:30:19
Speaker
So one of the other things we're keen to talk about is your low-code, no-code solutions within Penguin. So yeah, I'm keen for the audience to hear what you guys have been doing in that space. One is Power BI if you consider it low-code and no-code. So instead of people having to
00:30:35
Speaker
Which is what happened apparently in the past. There would be people writing SAP Business Objects code or Python code and making graphs for people and sending out on PowerPoints. But now, Power BI, you just have the datasets loaded and you're kind of clicking and dragging and dropping what you want it to do. But there is an element of coding in there. But there is a learning curve for Power BI. The thing we've gotten very recently is Amazon SageMaker Canvas.
00:30:58
Speaker
So that is an advanced analytics and machine learning platform. We don't need to know how to code at all. I like to think of it as Amazon's take on what's that thing that's super popular finance people, our tricks, our tricks, Amazon's taken our drinks. And that's been really helpful because there are people here.
00:31:14
Speaker
understand the concept of formulas, algebraic formulas, the understand the concept of forecasting, the concept of clustering that how do we group these things together, they don't necessarily understand the code behind it. They don't necessarily understand
00:31:31
Speaker
how do you code this up? They don't have the execution. They don't have the execution, but our local no code solutions actually allow them to do that, find what they need to, and go off and do that, which has been quite empowering. And we're seeing a lot more interesting stuff. And the other aspect that's happening quite a lot in the industry, but we're choosing not to use any of it right now till it stabilizes.
00:31:53
Speaker
is nearly everyone in the data stack has released some generative AI LLM into their tool. So Snowflake is talking about that data world has already done it. Microsoft Azure and Power BI have already done it. And why that's powerful is a lot of people hate SQL. I'm not going to lie, I'm also one of those people who's just not that good at SQL.
00:32:12
Speaker
But if you can type natural language questions saying how many cells was it last week or what's the likelihood we're going to sell this many romance books and then it makes the sequel for you or the Python code for you and gives you the results back, that is quite powerful for data literacy.
00:32:29
Speaker
The player who's done this the longest, who we don't use, who I absolutely adore, is ThoughtSpot. They don't believe in the whole dashboard thing. They've always said, ask a natural language question and get the query back. Because most of the business are humans with questions. They don't necessarily, exactly as you said earlier, have the execution to go get those questions. While we're not doing it here right now, I reckon in two to four years, it would just be commonplace in the entire industry. Well, I think coders will always exist, engineers will always exist, but that
00:32:59
Speaker
interface from a normal business user to the modern data stack will be a lot easier and much closer. So when I am looking at tools now, it doesn't matter what it is, I actually asked them, that's cool, you showed me the engineer's view, what's the business view of this and all of them saying, oh yeah, we are making our bot now and out of them and blah, blah, blah, it'll be out whenever.
00:33:17
Speaker
It's so true. I mean, I've noticed particularly in the last year, there's been a much, much harder focus on data teams to really be driving business value and answering business questions and upskilling in domain knowledge rather than their technical data knowledge. So being able to have that ability to ask a question and an answer respond, I think it's going to be really powerful.
00:33:39
Speaker
moving forward.

Sami's Reflections and Advice

00:33:40
Speaker
So, Sammy, we're getting closer to the end of the show. I think it'd be nice to, I suppose, summarize some of your biggest lessons that you've had and you're on your journey at Penguin since you've come and what you think people would value in them lessons, whether that's about your semantic layer or about any other sort of projects that you guys have run.
00:33:59
Speaker
Yeah I've got some great lessons but I don't think fully ties into anything we've said. Number one is don't be a fanboy. So I've seen so many mistakes where people are absolutely attached to a tool. It could be Snowflake, it could be Amazon, it could be Google and they're kind of blinded by their love of that tool and what that tool can do. They're not willing to let go so my entire team
00:34:22
Speaker
We discussed it and decided our mindset, we're not loyal to any of our tools. While we might like the people there, we're not loyal to any of our tools. Once upon a time, people never thought Oracle and IBM was going to fall. And then Amazon, Google, and Microsoft was around, but they kind of took them out, right? And then all this modern data stack stuff is taking out those guys. So we are not attached to our tools. And you actually see this in a lot of engineers and scientists. They're so attached to their tools.
00:34:50
Speaker
We don't care if SQL comes out of date or Python comes out of date. We are more than happy if Starbucks is a cloud platform and it blew Amazon and Azure out of the water to go towards them. We have no problem getting rid of Snowflake, DBT, 5tran or any tool that I mentioned today. But with the tool that's best for the solution. Yeah, and that's the mindset we believe. But other part of it is how easy because fabric is quite a nice thing from Azure. But the opportunity cost of implementation is
00:35:19
Speaker
too big to make it worth it. We have succession plans for every tool in our stack. Obviously, the hardest is AWS and Azure. I think the other aspect is do not get blinded by marketing. There are definitely mentioning names, but
00:35:34
Speaker
There are definitely modern data stack tools with very impressive marketing, very impressive marketing budgets. And I see that a lot of the times. I see people talking about their modern data stack platform, and I see the diagram or how point, whatever they have. And I look at how you've kind of picked everything that has the most marketing of the industry as opposed to what's the best.
00:35:58
Speaker
If we never had that mindset, or if we did have that mindset rather, we never would have had Data.World as our enterprise data catalog. We never would have had Velohi as our MLOps platform. So I'd say those are the two main lessons. Don't be attached to your tools and freely do your research. Be objective.
00:36:16
Speaker
Brilliant. Well, this brings us to the final part of the show. It's a part which is aimed at hopefully educating people who are aspiring in their data careers and hopefully you can give some advices to help them. So we asked all the guests the same questions. The first one is, how do you assess a job opportunity and what excites you about a job opportunity? It's a good question, especially as I'm leaving Penguin in the month. So what I look at
00:36:42
Speaker
is going to sound so boring is the financial reports of the company, if they're available. Because then it tells me, is this company growing or they're trying to do bigger, better things? For companies fairly stagnant, then I know I'm just there to keep the wheels turning. I'm basically WD-40. But if I see that they're growing, getting bigger and investing in things in their financial reports, then
00:37:04
Speaker
There's probably going to be investment in my area. That was something I picked up at HSBC actually, how to read those things and understand it. And when you read your first one, it's quite boring, but then after a while you kind of get the gist of it. The other aspects, and this is going to sound absolutely insane. I look at a job spec and I was like, does this look like three jobs?
00:37:22
Speaker
If it does, then they're probably really busy and I'll probably get to do a lot. But I respect for a lot of people that they want a job that's quite clearly defined. That's never been my thing personally. And same with my job title, I don't think it's an accurate reflection of what I do here. The other aspect is alongside, like, where is the company going financially? What does the job spec look like? The most important thing is what vibe do the people I'm speaking to give off? Because I've interviewed
00:37:49
Speaker
places where i thought would be my dream company or companies that i thought would be awesome just from like what they've done in the space and the interviews were absolute arseholes and i thought while you are a big tech company or you are a very big player i'm gonna have to work with you all the time i'm not willing to take disrespect from you now in an interview and that's a big thing i do in my interviewing with anyone
00:38:12
Speaker
Like this person could be the most horrible person in the interview, but I will literally treat them like with the best customer service because I still want them to be a customer of ours in the future. So that's a really big thing. Is the company doing well? If not, are they willing to invest? Does the job spec sound fun and kind of river direction? And finally, if you do get to interview stage,
00:38:35
Speaker
other people that assholes are not. Yeah, no, I think that's three great points, you know, the culture and who you're dealing with on a day to day basis is so important. And yeah, you want to be in a challenging role, which is pushing you and you're building bigger, better things. So completely agree in.
00:38:51
Speaker
moves you on nicely to what's your best piece of advice for people in an interview? That's a good question and someone asked me this the other day. I think the first one is don't seem desperate because that can always kind of come off negatively and it might be strange because it's an interview.
00:39:08
Speaker
But try to find that moment where you can build a human connection with the other person. Because most interviews I've either been the hiring manager of or the candidate of, it's kind of like a back and forth of question, answer, question, answer. And then the candidate asks no questions. So you don't really know if they're interested in the role.
00:39:26
Speaker
don't worry too much about the hiring manager's interview structure, because they kind of do that to make it as fair as possible. But if you find moments where, oh, I can tell this hiring manager is really interested in this technology, or this particular area of strategy, I should just talk about because then
00:39:43
Speaker
You're showing that person as well. You're not order taker. You can actually be a trusted advisor and work more quite nicely. Yeah, I really like that. So it's just about being real, showing your passion for when need be. And yeah, I think many people aren't looking for a yes, man. They're looking for someone that can drive solutions within the team. So great advice. And finally, if you could recommend one resource for the audience to help them up skill, what would it be?
00:40:08
Speaker
Udemy, honestly Udemy, because I said earlier that I studied psychology. And when I got into this field, I felt a lot of imposter syndrome about the fact that I just did Udemy courses, and learn programming and data, well, machine learning from there. But then at HSBC, I went and did a master's in computer science, and I'd never been so bored in my life.
00:40:30
Speaker
and I dropped out. And then at Penguin I went and did another Masters in Computer Science at Oxford just to feel like I belong in this industry. And I won't lie, both of them are wasting my time. I stuck through with Oxford because it's a very good brand thing to have on my CV, but Udemy is the best resource and there's a lot of free stuff on there.
00:40:48
Speaker
And if you can use it to make a portfolio somewhere, whether it's on Git or Kaggle, that'll re-help you get your jobs on.

Podcast Conclusion

00:40:56
Speaker
Perfect. I think they're new to me. You have a lot of practitioners who are building out real problems and real world stuff, whereas sometimes in the classroom, it's not quite what it seems. I say this now because I just finished. A lot of the stuff I learned at Oxford.
00:41:09
Speaker
best computer science schools in the world was seven years out of date. I was literally learning about stuff that we were migrating and deprecating. But yeah, YouTube, Udemy, I mean, I learned Java, SQL, Python, deep learning, all of this stuff of Udemy and it's done me well. Amazing. Well, Sammy, that's the end of the show. Really appreciate your time. It's been great speaking. And yeah, I'm sure the audience are going to be really happy to hear your insights from Penguin. Thank you. Thank you, Harry.
00:41:39
Speaker
Well, that's it for this week. Thank you so, so much for tuning in. I really hope you've learned something. I know I have. The Stack podcast aims to share real journeys and lessons that empower you and the entire community. Together, we aim to unlock new perspectives and overcome challenges in the ever evolving landscape of modern data.
00:42:00
Speaker
Today's episode was brought to you by Cognify, the recruitment partner for modern data teams. If you've enjoyed today's episode, hit that follow button to stay updated with our latest releases. More importantly, if you believe this episode could benefit someone you know, please share it with them. We're always on the lookout for new guests who have inspiring stories and valuable lessons to share with our community.
00:42:22
Speaker
If you or someone you know fits that pill, please don't hesitate to reach out. I've been Harry Gollop from Cognify, your host and guide on this data-driven journey. Until next time, over and out.