Introduction to Developer Voices and Joe Rees
00:00:00
Speaker
Today on Developer Voices, we're talking to one of the authors of the O'Reilly book, Fundamentals of Data Engineering. It's a nice lofty title, isn't it?
Defining Data Engineering
00:00:10
Speaker
You might ask yourself, what's data engineering? Isn't all engineering involving data? Come on, give me a definition. And yeah, that's basically where I begin the conversation with Joe. But this is one of those conversations that really took me by surprise.
00:00:26
Speaker
I thought we'd be talking about data serialization formats Apache Flink versus Apache Spark. The necessary evils of a bunch of Python scripts that munch things. But where we actually ended up was way more fundamental than that.
00:00:42
Speaker
What are we moving data around for? What's the value of engineering data to the end user? What responsibilities do we have to the business?
Technology and Business Goals
00:00:53
Speaker
This is very much a conversation where technology exists and data exists as a means of serving some larger goal. Here on this podcast, we love to think about how we're getting to the future. This is a good reminder that the how begins with the why.
00:01:11
Speaker
What and who are we doing all this data engineering for? In the end, I suppose this week's episode is a story about people using the business and within the business, told through the lens of data. So let's get started. I'm your host, Chris Jenkins. This is Developer Voices, and today's voice is author Joe Rees.
00:01:45
Speaker
My guest today is Joe Rees. Joe, how are you doing? Hey, good. How are you doing? I'm very well. Very well. Excited to talk to you about your newly released book, which has a great title, a weighty, meaty title, Fundamentals of Data Engineering.
Motivation and Origins of Data Engineering
00:02:03
Speaker
Yeah, we put the fun in fundamentals. So that's like
00:02:10
Speaker
I mean there are two sides to that right you've got fundamentals anyone who says they can write a book on fundamentals means business. And on the other side you've got data engineering which feels like we need a new term for that. I'm gonna say that why do we need the term data engineering have we been doing that since the sixties.
00:02:32
Speaker
I mean, you can make a strong argument that we have, right? So I mean, every, everything we've been doing with computers has data in it. And I suppose it's an engineering element to it. So, but I didn't mean to determine, I didn't see the term in the sixties. Uh, so, uh, but, um, I didn't make up the term data engineering either. Right. I just think it's a field where we're really, we're really good at, uh, coming up with new ways of describing things that are old and maybe the few new things tacked on. So data engineering is one of them.
00:03:01
Speaker
I think I first saw the term popping up maybe around 2015, 2016. And then again, we can get more into some of the backstory that I've recently heard about the origins of the term data engineering.
Unique Aspects of Joe's Book
00:03:11
Speaker
But the attempt really of this book was to describe a field that I feel like I know I'd implicitly been doing for a while and many other people had. So you're trying to delineate what this thing we seem to be doing anyway actually
00:03:29
Speaker
Yeah, yeah, it's about it. I mean, it's no different than data science. I mean, that came out in what, 2009 as a term. Yeah. That is some fine folks from LinkedIn, I believe. And, you know, I still to this day have literally no clue what data science means, even though we use the term every day. But I think
00:03:48
Speaker
But the field has been well described, right? Even if you can't come up with a consistent term for maybe say data science, you still kind of implicitly know what it is. But same thing with data engineering. I felt like that there were a lot of books prior to mine about data engineering.
00:04:03
Speaker
data engineering in AWS, data engineering with Spark and so forth. But what we hadn't seen, me and my co-author hadn't seen was people taking a step back and really trying to describe data engineering from first principles.
00:04:19
Speaker
Not tied to a particular technology, but in general. Correct. Yeah, which to us was a harder thought experiment. If you were to massage data engineering and peel apart what's there, what are the immutables? What are the things that aren't likely to change over the next five to 10 years? I think that was a book that Matt Housley, my co-author, and I were interested in writing. We certainly were not interested in writing
00:04:48
Speaker
data engineering with Python or something like that. Not to say those aren't great books and great ideas, but that just wasn't what we were interested in doing. So you want the more abstract stuff that's going to maybe even outlive our careers. I mean, that was one of the goals. You know, Kevin, who one of our actually people who wrote a praise quote said that this book will, you know, live on for decades. And maybe it will. Maybe it won't. I have no idea. But
00:05:15
Speaker
Oh, Riley too. And we had approached them with the idea for this book. They actually tried to convince us not to do it. They said, this is going to be a very challenging. Yeah. I mean, you know, two first time authors who have the hubris to try and describe an entire field. I mean, that's the kind of book that I think scares publishers. Yeah. Ambitious. Yeah. Trying to make concrete those kind of abstract terms. You can see people getting lost in the weeds for a long time.
00:05:43
Speaker
Oh, yeah. So, but we, you know, Matt and I, we were both too smart for our own good. And I would say too dumb for our own good too. So we figured this was, you know, something we needed to do.
What is Data Engineering?
00:05:59
Speaker
Which is challenging, we were running a business at the same time, I mean I was in the middle of remodeling a new house, and the Lord knows what other projects we had, and so writing a book in the middle of that was maybe not the wisest endeavor, but it was the path that we chose, or I would say that the path chose us. And you survived to tell the tale and print the cover.
00:06:22
Speaker
Yeah, did that. Okay. So we should get into what you discovered, distilled, invented. No, delineated. That's the word I'm looking for. I don't think I invented anything. Maybe we get, maybe we provided a different way of looking at the field. Maybe that was, I don't know if that's an invention. That's more just like, here's a lens. Yeah, lens. Yeah. Yeah. That's a very good way to put it. So.
00:06:48
Speaker
Looking through your lens of what data engineering is, can you define it for me as a field? Yeah, I mean the tldr too long didn't read of data engineering is, you know, your. There are systems that set upstream from you that generate data, you know, these might be relational databases,
00:07:07
Speaker
know, SQL databases, APIs or whatever. Often these systems aren't, you know, they don't treat quoted data as a first-class citizen, data engineering, a data engineer will take that data.
00:07:24
Speaker
Do something to it to make it useful for downstream consumption, for use cases like machine learning, analytics, reverse ETL or whatever. And so that's, in a nutshell, the role of a data engineer. A data engineer is responsible for what we call the data engineering lifecycle as well as its undercurrents.
00:07:43
Speaker
So again, looking at systems upstream, ingesting that data, transforming it, serving it all along the way, storing it in various ways, shapes, and forms. And then you're also responsible for the undercurrents. So that's like security, data management, architecture, ops, orchestration, software engineering, et cetera. So in a nutshell, that's what we felt really encapsulated the field of data engineering.
00:08:09
Speaker
And that's pretty big. I mean, by the time you're doing one thing, by the time you're doing anything plus security and large scale storage, that's already enough to be quite a lot for people to learn as a career. And we don't expect people to, you know, to...
00:08:25
Speaker
become experts in all these things, but we feel like these are the things that you at least seem to be aware of as a data engineer and be cognizant of it. Depending where you work, depending on the size and maturity of the company you're at, maybe you'll be handling all of this, maybe you'll be handling a small subset of it.
00:08:42
Speaker
To be aware of the holistic life cycle, I think it's incumbent upon data engineers because without that sense of how data flows and the various things that happen to data along its life cycle, I don't think you really understand the full extent of how you impact data as a data engineer or otherwise.
00:09:09
Speaker
Yeah, yeah. In a way, your very job, I think, is to connect between separate systems, right? Between sources of data and the people trying to analyze it. And if your job is building bridges and connections, you have to be aware of the larger picture.
Challenges in Data Engineering
00:09:25
Speaker
Yeah, the bigger picture is the one thing that I kept noticing, and same with Matt Housley, my co-author, we kept noticing engineers are very myopically focused on their task at hand or their specific silo. But through our consulting practice and through our experiences working at nine to five jobs, we noticed that mentality and that perspective.
00:09:47
Speaker
I think really short-changed data teams. If you aren't holistic, then you really don't appreciate again the impact that you have by understanding the impact you have. It means you can have a bigger impact in your job than a more healthy one.
00:10:10
Speaker
But how do you if you're if you're doing that, if you're trying to build together different systems as a whole and trying to get I'm thinking large organizations like banks is a good example. You're trying to build very tricky bridges between different things. It feels like it'd be very easy for that to fall into always being ad hoc stuff, building rope bridges between places that were never designed to be connected.
00:10:40
Speaker
bridges or like a, or like those Tarzan rope swings between them. Yeah. I mean, it's sort of like falling off the rope and crashing. Totally. Yeah. And kind of alligators or something. Um, I mean that happens often, right? And so I think that's, that this is exactly the result of not having, um,
00:10:58
Speaker
kind of zooming out and having the bigger perspective. This is precisely what happens. And we see this all the time. Most architectures, especially as you get bigger companies, they're more like Rube Goldberg machines than they are something that has any coherent sense. And that's a symptom. That's not a cause. That's a symptom of practices.
00:11:23
Speaker
maybe generations of various departments or fiefdoms or kingdoms of these organizations building things according to their will and so forth and not recognizing maybe the larger domain in which they operate and maybe the larger enterprise.
00:11:39
Speaker
So again, you know data engineers if you kind of zoom into what specifically you know We think data engineers should do then this you know, the the lifecycle management that we described earlier.
Persuasion and Lifecycle in Data Engineering
00:11:49
Speaker
That's You know, that's again one lens through which to look at it. I can't say that maybe we've completely You know described the field, but I think to the best of Matt nice abilities. I think we did so right Yeah, there might be some pushback, but it's fine. I
00:12:06
Speaker
It sounds like part of the job is always going to be persuasion as well, like the social aspect of data engineering. Yeah, that's the same in any field, right? I mean, it's like self-engineering persuasion and accounting is persuasion and so forth, but it's, you know, persuasion is just, I mean, yeah, there's two angles. I mean, this is something I harp on a lot and my podcast and writings too is persuading
00:12:34
Speaker
And selling is just one of these things where I think it's a very underrated skill, which we can talk about if you'd like. But I think that that's the end to the yang of being technically proficient and whatnot. It's great to be technically proficient, but then being able to translate that into systems or practices that people believe in and people can buy into. I would argue perhaps that's the harder skill to earn than to practice an organization.
00:13:03
Speaker
Yeah, I definitely think as an engineer, you don't have to sell things for money, but you do often have to sell ideas. I mean, well, you are selling it for money in a sense where your paycheck does. Yeah, but you're not trying to get someone to sign a contract and give you money so much as you're trying to persuade.
00:13:23
Speaker
Yeah, exactly. Even as an engineer, you're trying to persuade other people on your team. This is sometimes the hardest part, especially when you have a lot of really smart, opinionated people, which tends to be engineering teams vying for ideas. I'm sure you've been excited on meetings with dev teams.
00:13:44
Speaker
analytical teams and so forth. These conversations are fun. They can also be brutal. Persuasion is definitely something where I think people should get stronger at. But it's also persuading in a healthy manner. What I see often is when you get too many smart people in a room with opinions, the risk of bike shedding, if you know what that term means. That means a very real thing.
00:14:07
Speaker
That's the one thing where you got to have a situational awareness to know when you're entering the land of extreme naval gazing and pointless conversations. So, it sneaks up on you. Yeah, definitely. Okay, so speaking of situational awareness, maps. You've outlined this data lifecycle as a way of looking at the map of the landscape.
00:14:33
Speaker
of data engineering. So take me through your idea of the data life cycle. Data life cycles really, if you kind of zoom out to its entirety, data life cycles are very fractal.
00:14:49
Speaker
And they have feedback loops in each other. So if you take the general data lifecycle of when data is created and then things happen to it along the way, morphs into different shapes and forms. And it has an interesting value stream map almost. Data isn't just created for willy-nilly. There was an intention behind its creation, an intention behind its usage, and maybe an intention in terms of when it's archived or when it goes away to die.
00:15:22
Speaker
It goes through different domains, different processes, different modalities and so forth, different data models and all this stuff. The data engineering life cycle really encapsulates, I would say, the boundaries as they exist right now. Again, I'm open to changing this thought process as practices evolve in the future, but as it stands right now,
00:15:46
Speaker
A lot of data that's – okay, so data as an entity versus data as a department, distinguish between those because the term's overloaded. But data as a thing maybe that's used in applications, right?
00:16:10
Speaker
in various ways and shapes and forms. But there's also this other thing, you know, there's this whole practice of quote, data, which involves maybe data science, you know, whatever that is, we can talk about that, analytics, machine learning, and many other use cases.
00:16:28
Speaker
that apply themselves in non-application uses,
Building Strong Data Foundations
00:16:33
Speaker
right? So then there's this sort of a divide right now. And again, we can get into why I think this divide's maybe artificial or maybe needs to go away. But as it stands right now, the role of a data engineer is basically the sort of the middle layer between applications and the use cases of quote data.
00:16:49
Speaker
And why did this come into being, right? This life cycle, which we talked about. Well, it was implicitly something already done by data scientists. I mean, this old trope about data scientists spending 80, 90% of their time getting cleaning, doing stuff with data, stuff they weren't trained to do, stuff that they hate doing, but they do it anyway, because that's the old trope. They're supposed to do this and then the prerequisite to getting the actual job they want to do done.
00:17:13
Speaker
Right. So how would you feel if 10 to 20% of your job was spent doing the thing that you're actually trained to do? And the other 80, 90% of it was spent doing all the stuff that you weren't trained to do and you kind of loathe. That's the reality of data science. I mean, my title on LinkedIn is recovering data scientists.
00:17:30
Speaker
A lot of that was born through just direct experience of having done this. At some point, I was like, I need to build the systems and the workflows for me to be able to do my job as a data scientist. I wasn't the only one. This kept happening over and over. I met my coauthor and business partner, Matt Housley. He had a lot of the same experiences. I noticed data engineering started coming into the fray.
00:18:01
Speaker
I think it's sort of this recognition that, okay, so software engineers
00:18:06
Speaker
They aren't going to manage this life cycle of getting data from applications, transforming it for use cases, storing it, and serving it. Maybe that's a data scientist, but maybe there's sort of this emergence of a new title, I suppose, data engineer, that sort of just came out, I think, of outer necessity to serve the use cases of the data scientists, because these data scientists, these poor people, for the most part,
00:18:34
Speaker
I think a lot of them like myself just went into data engineering out of practicality because it's like, unless you build a foundation of getting the data to be able to do something useful with it, you have no data upon which to do the science, right? So then you're. Okay. So make that concrete for me. How do you build that kind of foundation?
00:18:55
Speaker
Yeah, it's, well, it's just that easy and just that hard, right? So it's like, you got to understand what your use cases are. You got to understand what your requirements are. What kind of latencies do you need? So when a data scientist is serving the end customer, what's needed, right? It's not necessarily the data scientist that's the end customer. The end customer is the person that's using that data for, you know,
00:19:21
Speaker
to make decisions, it might be a machine learning model that's serving billions of users in an application. We'll talk about feedback loops in a bit. It goes back to that. So the end customer is both human and machine at this point. It's not just a person. But it's the end use case. In the end use case, again, if you want to take it through another step, the end customer is the end customer. So it's the user of an application with that machine learning model.
00:19:50
Speaker
It's a person who's impacted by the decision that's made by the person looking at the report and decisions made from that. And so you work backwards from those use cases and then you figure out what that foundation needs to look like, right? But I think there's certain immutables.
00:20:06
Speaker
in the foundation, you're going to need to ingest data from sources, you're going to need to store it in some mechanism, right? Or whether that's a very temporary storage, you know, in the case of streaming data, that data is stored as a log somewhere, or maybe it's a long term archival, you know, or maybe it's available in a data warehouse or a
00:20:26
Speaker
lake house or whatever, but then you gotta have different ways of integrating and transforming data for those use cases. Maybe it's a dimensional model for a report, maybe it's one big table for a report, maybe it's making data useful for feature engineering for a model, machine learning model that is. So there's a whole other universe. And then of course there's the machine learning engineering universe, which we won't talk about, but that's a different parallel universe.
00:20:54
Speaker
But I always look at things through the end customer. And the external customer typically, at the end of the day, how is this impacting somebody? What's the use case? Then you work backwards and you figure out what that is. But the life cycle of data engineering, as far as Matt and I could tell, it doesn't disappear.
Data Integration and Modeling
00:21:14
Speaker
We tried to beat the crap out of this idea. It's like, okay, so under what situation would this thing go away? And we're like, I don't know as it stands right now unless
00:21:24
Speaker
The only time I see this shrinking really or morphing is when software engineering and more to the point like application data sort of become a lot more integrated at which point maybe it does, which we can also talk about how that happens, but... Well, yeah, this is... So going back to the idea that you have that data is written with intention, it sounds like you're saying, okay, so the application writes some data with a particular intent.
00:21:54
Speaker
someone over there wants that broad data because they have a completely different intent. And somehow it's the job of the data engineer to connect intent. And I'm wondering if you say that the data engineer has to be aware of the final intent, but maybe they don't. Maybe it's their job to get data written with one intent
00:22:36
Speaker
Generic data, I mean, it could work, I think, depending on your use case. Maybe, maybe, maybe not. It's supposed to depend on what you define as generic data. One thing I've been thinking a lot about, and it's not a new idea, but my friend, John Giles, so he's kind of one of these people who's like a, I put in the category of just like this Yoda type person. Is Yoda a person? I don't know. But anyway, Yoda. Yoda, right. Yeah, human rights.
00:22:47
Speaker
and make it generically available in an unopinionated way.
00:23:04
Speaker
I put him in that category where he's really thinking about – so I'm writing a new book on data modeling right now and I've really been thinking about data modeling from first principles. One of the things he describes in his books, the nimble elephant for example, is the use of patterns.
00:23:20
Speaker
So data model patterns, right? So back in the 90s, actually, people were coming up with these generic data model patterns that could be used across industries, right? So maybe you're manufacturing, here's a manufacturing model pattern that you should use. And really, you know, as you think about this, there are higher level generic
00:23:44
Speaker
modeling patterns you can use. For example, the party and the role is a classic example, right? So there's a party, this might be a customer, this might be a dog, but they might have a role, right, and so forth. And so these higher level abstractions certainly could be used, I think, to come up with, quote, generic data. Or at least I would say more flexible patterns of using data.
00:24:09
Speaker
That's always what in the early days is what like cod was trying to get out with relational data, right? Yeah, data that was completely separate from the way you might use it.
00:24:21
Speaker
Yeah, exactly. Because back in the day, ISM and other databases with different hierarchical approaches or network approaches, for example, and his relational model is really taking, okay, so if I take set theory and trying to apply that to data, what does this look like? The relational model is definitely a wonderful
00:24:44
Speaker
It's not even a physical abstraction really. It's a logical abstraction of the data and it's somewhat conceptual, but more logical. And yeah, so I mean, people have definitely taken that. But what I find interesting is, you know, if you ask application developers, okay, so show me the relational model you're using.
00:25:04
Speaker
What normal form is your database in? I've asked this to a lot of developers. I would say half of them don't even know what that means right now or maybe studied it for a second and CS or the IS class. These abstractions are, I think,
00:25:24
Speaker
So it's interesting because I'm trying to reconcile the notion of, OK, so data from this ivory tower standpoint where we view it as we can talk about the generic abstractions of data and the methodologies of getting there. But then you try and balance that against the real world implementation of it at a physical level. And it gets very fascinating because this data traverses through the lifecycle. Applications, for example,
00:25:52
Speaker
If I'm an app developer, hey, maybe I'll just use, I don't know, MongoDB and just store everything as nested collections and documents in that collection. Then I've seen some crazy stuff. I saw people using dates as keys in Mongo.
00:26:10
Speaker
Yeah, exactly. That seems very troubling for most use cases. Yeah, yeah. That happened, and it was troubling. It called us in. They're like, this thing doesn't work. I'm like, no. I'll show you why. You might not like the answer, but. But there's a flexibility, right? So I think because tools have gotten very easy to use, what it means is, and people maybe haven't learned the when or why of
00:26:38
Speaker
you know, maybe data modeling, for example, or other practices, now it's like, okay, I'm just gonna throw data into this easy to use thing. And, you know, this and then things happen. And so, but then you start multiplying this across every every time data moves across its lifecycle, maybe for a different purpose now.
00:26:57
Speaker
You know, every tool is easy to use these days. I would say overwhelmingly. It's not like, oh, it was 10 years ago. Everything's off the shelf for the most part. It's like super easy to get into. I always equate it to a bear trap. Easy to get into, very hard to get out of, and painful. So you might lose a limb trying to get out of this thing. And this is very much the way it is right now with data. So to bring it back to the life cycle, right? It's kind of like,
00:27:21
Speaker
If you're aware of just the various stages that data moves and your areas of responsibility, hopefully that at least gives you a framework and a lens through which to view maybe, how do you approach practices?
00:27:37
Speaker
higher level, maybe a more intelligent level. Because if you approach it from a very myopic view, I have to figure out the cheapest way to finish my ticket for this sprint in a way that may cause insane collateral damage, but whatever. That's fine.
00:27:57
Speaker
blow up the village if I have to to get this sprint done, then that's what it is. But that's the approach I think a lot of people take now. It's sort of what's the task in front of me and how do I do this as quickly as possible? And so the notion of the life cycle is ready to give you the context of the various things you should at least consider.
Data Transformation and Serving
00:28:14
Speaker
And if you're aware of the life cycle and you still want to ignore it, I suppose it's your prerogative and you get to deal with what happens.
00:28:22
Speaker
Okay, so let's think more about that then. So your life cycle is, I've got the diagram just in front of me, I'm going to read it off. So you've got generation, ingestion, transformation, serving to the final use case. Let's
00:28:44
Speaker
I think it's interesting that, I mean, there's a lot that's interesting about that and I guess you spend a whole section of the book unpacking it, right? But ingestion is always harder than you think it's going to be. Transformation is not, doesn't seem necessarily a part of a danger engineer's job, aside from changing data formats. And serving is just a whole world of, okay, so technology choices. Can you give me some concrete guidance on this?
00:29:15
Speaker
Ingestion's always hard because you're relying upon data that you typically have no control over, right? Yeah. So you're on the receiving end of it, right? So the data's in whatever format it happens to be in, it's in whatever storage mechanism it happens to be in, and you get to try and ingest that given a whole body of constraints. That's super fun. So you can do push or pull.
00:29:37
Speaker
Um, you know, batch or streaming, you know, the, the world's your oyster, uh, have at it. So that's, that's, as you point out, the very, very hard part there. Um, transformation is an interesting one. I feel like that, um, you know, is.
00:29:55
Speaker
It is a quasi-optional section in some ways, right? In the sense where maybe you don't need to transform the data. Maybe you just ingest it and then serve it, right? In which case, why not just go to the transactional system or the resource system and just get the data yourself, right? You can do a read replica from a database and just do it that way. But sometimes you want to store that data in a data warehouse, maybe in its raw form, and you're perfectly fine doing that.
00:30:22
Speaker
So somebody asked me, my friend Larry Burns actually asked me, he's a really big data modeler too, one of the people I put in the Yoda category, but he was, I think Bill Inman actually asked me this too. He was the guy who created the data warehouse, but so these people were like, why aren't you talking about integration? And I was like, well, we're about to talk about integrating data in the context of transformation, right? But then what happens when you have a stream that you haven't integrated any other data with?
00:30:50
Speaker
Then integration is not the same as transformation. Integration could just mean I'm combining different data sources without a transformation step. That's the delineation there. With things like streaming, maybe you don't need to. The stream is a stream. I think the transformation is where it gets a bit interesting because that's where
00:31:15
Speaker
the data starts getting value arguably, right? So if data from the source system already has value, I would assume that the data maybe accidentally or purposefully had a data intention in mind, not just the application intention. Then you get bonus points for skipping over that one. Congratulations, good job. You're working really well with your application team or the people providing the data and I think that's pretty awesome.
00:31:43
Speaker
But typically, transformation is where it gets interesting. Because you are having to either transform a data set, a raw data set into something that's amenable to analytics, say, we'll use that as an example. So aggregations and different ways of molding the data for an analytical purpose, that's often tricky, right? Yeah. Because again, the data from these source systems often doesn't have a consideration of analytics in mind.
00:32:13
Speaker
You know, or if you were to take it a step further and transform data into something that's, you know, say a dimensional model like with Ralph Kimball came up with for star schema, that's, you know, a very well established, well trodden way of providing analytics, you know, in a way that
00:32:31
Speaker
preserves business logic, business rules, and so forth, and preserves those concepts that ultimately a business user, when they're using the data, can access it and understand it. And so again, it depends on your end customer. That's where transformation comes in. And I feel like that's one of the chapters where I was actually least satisfied. And that's actually why I'm writing a book on data modeling, because I felt like the whole notion of data modeling is just
00:33:00
Speaker
It needs to be, I think, re-exposed to a new generation of engineers, whether they're software engineers, data engineers, data scientists, whatever. Yeah, it feels like in jumping from the single machine relational era to the petabyte multiple servers, we need new kinds of databases era, we forgot a lot about data modeling.
00:33:28
Speaker
Oh, yeah. Yeah. And it's at a higher level too, right? It's high level modeling. It's not even like how you physically implement it, but it's like, how do you translate business concepts, you know, in the way data is used in the enterprise? And the enterprise is a very enterprise-y term, but I'll use it. But okay, so, you know, and again, the way I'm seeing it is it's, you know, whether you're drawing on domain-driven design,
00:33:55
Speaker
as Eric Evans described it back in the day, whether you're talking about high-level conceptual and logical modeling across the enterprise, John Giles calls it enterprise data modeling and enterprise logical model. But translating these concepts in a way that these can be reused across different aspects of the data lifecycle. And so there's no confusion, right? Right now, it's like when you talk about a customer, for example, or a product, these are two classic examples of things that get lost in translation. And by the time
00:34:24
Speaker
It goes from an application where the engineers have one idea of a customer all the way to data scientists who may have a different idea of a customer. I think it just behooves us as anyone who touches data to really take a step back and just have a unified way of looking at these business rules, business concepts, definitions, and so forth. But again, easier said than done. We can pontificate all day.
00:34:49
Speaker
But I think this is one of the central struggles of where the data field is right now is we actually don't have, we tend to operate very much in our own kind of a, you know, like I said earlier, the life cycle is very fractal. And if you kind of zoom in on where people practice, right, so different, different, different intents, for example, become different silos.
Future of Data Engineering Practices
00:35:13
Speaker
And so that's, you know, one thing I've just been,
00:35:16
Speaker
thinking a lot about, but as it pertains to the data entering lifecycle, you know, kind of bring it back to that, like transformations is one of these things where if you can get the data right upstream, right, the generation and the ingestion part, multiple generation part, everything is easier. You may not even have to transform data, but you're transforming mainly because that's
00:35:37
Speaker
You know, just how are you going to translate it into various other intents down the road, right? And then serving, you bring up serving like that. That's a gobstopper. I mean, jeez, any number of ways of serving. Well, before we get into that, if I can step you back just a second. Yeah, please do. That makes me wonder, like this idea of pushing the responsibility for data upstream.
00:35:57
Speaker
Are you an advocate of like the whole data mesh data as a product data should be a primary concern within each department.
00:36:07
Speaker
idea. Yeah, I am. I'm a fan of, I think, the notion of data mesh. I mean, the book's sitting over there, and Jamax is really... The font size is slightly too small for me. Oh, yeah, yeah, yeah, no worries. Yeah, and Jamax is a really good friend of mine. So, you know, Jamax Agani, the one who came up with data mesh. And so, you know, we talk often. And I think that
00:36:29
Speaker
To me, it represents an ideal that I personally agree with. I know some people don't, but I think that right now, and it's ironic because I wrote a book called Fundamentals of Data Engineering, but I feel like the distinctions between
00:36:48
Speaker
upstream and downstream consumers' data. I think this is largely reflective of sort of the struggles we've had in IT for the past few decades where we have systems that make data and then we have to do analytics and all this stuff. But I feel like, especially as we move towards data applications becoming more normalized, what I mean by that is
00:37:16
Speaker
There is no separation between an application and the data, if you know what I mean. Machine learning powered applications like Uber, for example. I don't think there's really a ... From a user standpoint, it's not like I'm sitting there like,
00:37:33
Speaker
my other part where I'm going to access the data warehouse for my reports and stuff like that. It's just everything moves in a very harmonious way between application and data and back to application again. That's the feedback loop that I, even in the last chapter of my book, I write about where I think that's hopefully the norm, but that also means data mesh becomes a practicality at that point.
00:37:55
Speaker
But, you know, I mean, there's there's a lot of I think there's a lot of interest in this notion. But when you kind of zoom out and look at this, the landscape of companies where this is going to be a challenge, that's like most of them. So, yeah, yeah, we can get into a few like, but I am a fan of data mesh to answer your question. Yes. OK, so the related to that and absolutely getting into companies where they're not doing that.
00:38:22
Speaker
A friend of mine, Bobby Coldwood, we were talking about event storming, like where you get all the different stakeholders in the business into a room to talk about what a customer actually means to us all and that kind of thing. Do you think that is a useful, fundamental data engineering skill to pull people into a room and just thrash things out as a company? It could be. I think if your organization
00:38:53
Speaker
is flexible to that type of a thing, then sure. If you're an organization where you're just told to sit down, shut up and do your job, then I don't know if you're going to have much luck doing that. You know, hey, everybody, let's get together and talk about customers. Like, can you go back to your desk there and just not talk to us? That is a lot of companies, right? I think what we're talking about. Yeah, yeah. I guess what we're clawing at is, is data engineering
00:39:20
Speaker
proactive or remedial? Are we trying to push the way data is done into a new future? Or are we just trying to cope with the way things are today? I mean, if you're asking me what I would prefer, I prefer that people are, you know, proactive and, you know, pushing things forward. You know, but I don't know that that's
00:39:42
Speaker
the reality of most people's just intrinsic human motivations at their job. I mean, most people want to do the bare minimum and be as lazy as possible while putting in the appearance. You're trying to do as much work as possible. I mean, I've been around the block a bit. I'm not saying that I've seen, but I've
00:40:03
Speaker
I've been around the world a few times. I've seen a lot of companies operate. And cynically, that's how it is, right? I mean, the non-cynical version to me hopes that that actually happens. And again, the motivations for me writing the books that I've written and the book I'm working on as well as a lot of the courses I think
00:40:21
Speaker
It's a help urge people that there is a different way. We don't have to be stuck in this old school IT mindset where data is a cost center and we just have to be on the receiving end of everything that happens. We can push and make changes in the business. And in fact, for the business to be relevant, it behooves us as an industry to really help drive that conversation forward.
Data Industry Maturity and Growth
00:40:45
Speaker
You know, so it's about getting out of your cubicle if you happen to be in one of those things or, you know, and just really, you know, I would say understand how you can help impact the business more. But again, you know, that's what I want. I mean, I give very cynical responses because I
00:41:03
Speaker
I see enough to be a bit jaded. At the same time, I don't give up hope that we can make a difference as an industry, but I think because we have to. I talk a lot about this in my sub-stacks really. Again, I talk to people like Bill Inman quite often, who's a really good friend of mine. He's been in his programming back in 1960. I consider one of the people that
00:41:29
Speaker
help to invent the data industry to a large extent. And I ask him, why don't we still talk about the same stuff that I keep reading about back in the day? Adding value to the business and all these types of things. And he said, you've got to recognize this industry is very immature. Accounting has been around forever and thousands of years, arguably, if you look at something.
00:41:55
Speaker
hieroglyphics and so forth. I mean, the arguably those are about accounting transactions. So, you know, but data as a profession, especially in the modern age of, you know, computing, which is still considered somewhat modern, it's like, it's, you know, it's still new as we're still trying to figure this out, right? It's like half a century old, which is nothing.
00:42:15
Speaker
nothing in the grand scheme of things. I'm reading a book right now, a really good book called How Data Happened. It's fascinating. It's good. You should read it too. For the audience, you should read it, but it's a good book about just the history of our modern usage of data, like statistics, right? That wasn't always a thing.
00:42:37
Speaker
And in fact, there was a period of time when people actually
00:42:46
Speaker
hated the idea that we were tabulating data and trying to figure out how does this data describe society, right? And so that was only back in the late 1700s, early 1800s. And so data collection is still fairly new in the grand scheme of things. And so for us to, and I still think the bridge between
00:43:13
Speaker
IT and the business is still very real. In fact, Bill and I were doing an event last week in Denver and he was talking about his biggest struggle. The biggest one that he's seen in his career is what he calls a divorce between IT and the business. It's a real thing. His goal in life or his wish is that hopefully that divorce is a real thing.
00:43:39
Speaker
you know, mended and everything's fine, but it's going to take some time. But he feels like we're getting there, right? I mean, the silver lining is he feels like we're getting there. And it's through, I think, discussions like this. I think through just being aware, right? I think the problem is with IT is, you know, to zoom in on that for a bit, it's like we tend to be very IT-centric and not focus on what the business needs. The business, on the other hand, tends to look at IT as a cost center.
00:44:06
Speaker
And by IT, I'm lumping in data teams and all this kind of stuff. But what changes this is obviously when data and technology becomes the center of the business. So obviously, newer school companies or companies that grew up in the internet era, quote unquote,
00:44:22
Speaker
you know, and the mobile era now today, you know, it's a much better shot at doing this because there's no difference between technology and the technology is the business and data powers it. And so that's the main difference. But then again, back to, you know, the fact that there's a lot of these mature companies out there that hasn't been the case.
Future-Proofing Data Careers
00:44:39
Speaker
It's, you know, it's it's it'll it'll have to happen. Right. I would hope so. I would hope we're gradually clawing our way towards the future on that.
00:44:51
Speaker
Yeah, so we'll see what happens. I mean, everyone does adopt AI in the next year and it'll be all over, you know, everything will just be great. Yeah. Yeah. So I'm told, so I'm told AI and blockchain and we won't have any problems anymore. Nah, it's all fixed. Okay. So obviously that's going to happen, but if it doesn't happen and someone wants to do the hard graft of moving us towards a data future.
00:45:20
Speaker
Give me a quick top three or four things someone should focus on learning to apply to their career.
00:45:34
Speaker
Obviously, get my book and read that. Hold all the keys to the kingdom here. Educate yourself. So obviously, you have to know the technical aspects of your job. To me, that's table stakes. Know how to code. Know how to use your systems.
00:45:56
Speaker
best practices. If you're in a cloud, know how that cloud works. Maybe I dare say get a certification so you really understand how they want you to think about that platform. I think there's good steps. Get the technical foundation there. The other things you need to do though are definitely not technical. This is kind of what we were talking about earlier, but the hard part is really taking that technical acumen
00:46:25
Speaker
and applying it to solving real problems i know that sounds completely cliche like i should be giving like a conference at a regis or something talking about this. Add more value in all these like stupid tropes.
00:46:41
Speaker
But people say this and people keep repeating this because it's absolutely true. The entire industry is still crying the same thing after a while. You should probably notice this thing and this is where I think it's that hard.
00:46:58
Speaker
But there are some ways to do this. I think, again, learn the bigger context of your craft and how your craft fits into a business. Again, the data engineering lifecycle is for a data engineer that's at least one lens through which to view the world. I can't say it's the only lens, but it's a lens I think would be highly useful to you. Understand up and downstream too. If you're working with a software engineering team, how are they doing things? What do they want? What are their goals? What are they incentivized by?
00:47:25
Speaker
That's a big one. Like are they incentivized to finish their sprints or like what makes them tick? Because for you to work with an upstream team means you need to learn to play ball with these people. They aren't going to play ball with you. And the fact of the matter is, um, they don't need to care about you, right? Yeah. And the issue is you completely rely on them. Yeah, you need them. 100%.
00:47:48
Speaker
Oh yeah, so you better know how these people work and you better make some friends there for you. Don't go into this with a contentious situation saying, I demand my data look like this and blah,
Collaboration and Value Stream Mapping
00:48:00
Speaker
blah, blah, blah, blah. I'll be like, yeah, that's great. I'll get to you when I get to you. So definitely develop a good rapport with people who you depend upon for the data. And then obviously downstream,
00:48:14
Speaker
Work with data scientists, work with analysts, work with people who are dependent upon you, understand what they want, maybe spend a day shadowing them, just understand, okay, so like, you know, when a, you know, when a machine learning model is built, like, what does this look like? And how is it used? And what are you using my data that I'm providing you and then
00:48:32
Speaker
what happens to it, and same with the report. And I think having that empathy goes a long way. I think the reason I gravitated towards data engineering was because I really understood what the output needed to be, right? It's not like it wasn't a mystery. It's like I know what this model needs to look like and I know how to get there and you work backwards. And so it's about working backwards and forwards, right? Again, upstream, you need to understand those systems that generate the data and the people who are responsible for giving you the data
00:49:00
Speaker
I wouldn't say responsible if the people you depend upon to do it. They aren't responsible for anything as far as they're concerned, you. So that's the reality of it. But then you got to zoom out. So those are immediate up and downstream stakeholders shifting left and right, but then it's zooming out for the business. One thing I say is not even an exercise, but it's a practice, a value stream mapping.
00:49:21
Speaker
Learning that and like yeah, yeah, that's I think that's an awesome Superpower so it this is a practice that came from lean. I'm familiar with that the manufacturing practice. So lean is all about like Lean's awesome. I okay. So like you've heard of DevOps, right? I mean that that yeah got its start in an inspiration from lean lean is about like
00:49:45
Speaker
removing bottlenecks, eliminating waste, and providing more value to the end customer as efficiently and cheaply as possible. And so value stream mapping is a practice within Lean where it takes, okay, so if I have an external customer,
00:50:01
Speaker
So customer places an order, what happens from the time that order is placed until the customer receives what it is they happen to order. That's the value stream map. Value means the customer provides value, you get the value of their hopefully money or whatever they provide you.
00:50:19
Speaker
But it's about mapping that entire flow, right? Like what happens? Each step of the way, whether it's a process or information flow. This to me is like the superpower, because if you can understand, you know, end to end, you know, how a customer, how you provide value to a customer,
00:50:35
Speaker
And then you understand, OK, so along the way, there's data, right? When an order is placed, that's when that order is created. That's, OK, so what happens after that, right? And then shipping confirmation, so all this stuff, the data flows, right? And I think if you understand how that works, that's one of the most underrated superpowers. I think you could have the data practitioner, whether you're an engineer, a scientist, software engineer, whatever, but just understanding that flow and understanding, like, how do you impact bottlenecks, right?
00:51:01
Speaker
Where can you eliminate waste? So one thing, wasteful is batch. Batch is inherently wasteful. Don't let things queue. Each process would pull what it needs. So maybe that's an argument for streaming, for example, where things just move in a continuous flow. Why do you have to wait on things? That's a cardinal sin and lean. You don't wait. You eliminate waiting. Stupid.
00:51:26
Speaker
So understanding waste, right? And what is waste? What is what is necessary waste? What is unnecessary waste that needs to be banished, right? These are the things that I think lean really advocates for. But I would say the superpower for, you know, practitioners and technology is I think adopting these principles and understanding
Conclusion and Future Discussions
00:51:42
Speaker
them. I hopefully this is
00:51:44
Speaker
Because if you can, if you have this perspective, then you actually understand. You start having empathy, empathy with the business, right? And so I think that's one of the things I'm trying to advocate for. Um, so I don't know if I'm like making success, you know, progress in that or if I'm just screaming at the sky, but, uh, it's a lot of fun either way. So yeah, I can imagine. It's funny. There's a very neat parallel there. You start off by thinking your job is to get from.
00:52:08
Speaker
You know, one serialization, Jason API over there to proto buffs over there. What you're really doing, it's not just integrating data, but integrating people and ideas. Bingo. Yeah. Yeah. That's all it is. That's all it is. I mean, you can make APIs for their own sake. That could be fun too. That's not going to move the business forward, right?
00:52:30
Speaker
That's exactly it. And I think it's the end of the day, you know, the other superpower is just understanding what the business wants. So what you just said, like, what are you doing? And it's like, again, I feel like I'm that guy in the blue shirt and khakis at the Regis talking about adding value to the business. But at the same time, it's like this, that is the job, right? Yeah. And the particular technologies will enable it, but you have to understand context and what your
00:52:55
Speaker
Exactly. What you're really integrating, which isn't just bits and bytes. Right. Yeah. Yeah. So, okay. We're ending on a hefty, weighty business wide note there, which makes me happy to give you your fundamentals badge. Thank you. Joe, thanks very much for talking to us. Cheers. Anytime, Chris. Great talking to you, man.
00:53:20
Speaker
Thank you very much, Joe. I hope the book on data modelling goes well. It sounds like something we do need. Maybe we'll get you back in the studio once it's published. Until then, we will be speaking to lots of other guests. So if you've got a suggestion for who we should be talking to, please drop me a line. I know there are some people out there who are happy to volunteer themselves. This is good.
00:53:43
Speaker
but feel free to volunteer someone interesting too, I'll see if I can persuade them to talk to us, as ever you'll find my contact details in the show notes. But you don't need to do suggesting, you don't need to do anything except maybe click like and subscribe and all the answers will be delivered to you in due course.
00:54:02
Speaker
This is a data stream I'm very happy that people are subscribed to. And we will pub your sub as soon as we can. So on that note, I leave it till next week. I've been your host, Chris Jenkins. This has been Developer Voices with Joe Rees. Thanks for listening.