Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Jean-Georges Perrin - Make Open Data Contract Standard (ODCS) Work image

Jean-Georges Perrin - Make Open Data Contract Standard (ODCS) Work

S1 E6 · Straight Data Talk
Avatar
26 Plays3 months ago

Jean-Georges Perrin is a serial startup founder, currently co-founder of AbeaData [https://abeadata.com/], and co-author of "Implementing Data Mesh." He is the one who championed the PayPal's data contract project, which is now part of Bitol and the Linux Foundation. In this episode, JGP speaks about building and maintaining open-source data contract solutions using open standards. He shares a lot about why and how he came to it and the challenges of maintaining it to avoid appropriation of the solution. JGP discusses how they balance the interests of different groups in developing a community around open data contract standards. More importantly, he shares how data contracts can positively change the life of every data engineer.

Check out JGP's LinkedIn
Check out Bitol -  Open Standards for Data Contracts and become a contributor.

Transcript

Introduction to 'Stray Data Talk'

00:00:01
Speaker
Hi, I'm Gilett Kuchov, COO and co-founder at Mesh Radio. Hi, I'm Scott Herlman. I'm a data industry analyst and consultant and the host of Data Mesh Radio. We're launching a podcast called Stray Data Talk and it's all about hype and data field and how this hype actually meets

Podcast Vision: Successes and Challenges

00:00:17
Speaker
the reality. We invite interesting guests who are first of all data practitioners to tell us their are stories, how they are putting data into action and extracting value from it. But we also want to learn their wins and struggles as a matter of fact. And as Ilya said, we're we're talking with these really interesting folks and that a lot of people don't necessarily have access to and that these awesome conversations are typically happening behind closed doors. And so we want to take those wins and losses, those those struggles, as well as the the the big value that they're getting and bring those to light so that others can and can learn from those. And we're going to work to kind of distill those down into those insights so that you can apply these amazing learnings from these really interesting and fun people and apply them to your own organizations to drive significant value from data. Yeah, so every conversation is not scripted in a very friendly and casual way. So yes, this is us. um Meet our next guest. And yeah, I'm very excited.
00:01:19
Speaker
Okay, well, hello, it's us back. It's you and today we have a pleasure her to have to host the one and only Jean Georges, also known as GGP and also known as BNCL data. ah Yeah, and I think obviously it's clear what kind of names he enjoys best. It's a very dear friend of mine ah who is now running a company, a big data.

Guest Jean Georges' Introduction

00:01:52
Speaker
um ah He's also founder of VTOL and also happened to be a author of a few books. Yeah, and now over to you, GGP. Thank you so much for joining us. Happy to have them.
00:02:07
Speaker
Well, I'm i'm great. i'm I'm feeling really great. I'm not great. I'm feeling great. that and And thank you so much. I'm really grateful for ah your invitation. oh there is so There is a bit of a complicity between Julia, Scott, and myself. And so so it might it might transpire a little bit into this episode. So there might be a few private jokes and things. We'll have to we have to we'd have to ah edit a little bit of the footnotes, I guess. ah But yeah, thanks for having me. ah So my name is Jean Georges or JGP or Beyonce of the data. um I'm the chief innovation officer for a company called ABIA Data. So it's a startup and um which we created with a good friend of mine as well, Kim Fies and Todd Nemanage, after our experience ah at the at PayPal, and especially in the field of building data mesh.
00:02:59
Speaker
ah So it gives you a little bit of the sense of where we're going. um i like i yeah i also and As you said, I don't have only one hat. I'm also the chair of was the Beatles project at the Linux Foundation. so it's ah The Beatles project is... is an attempt at building standards for modern data engineering. okay And it's it's a successful attempt so far. So we are have one standard under our belt, which is Open Data Contract Standard, or ODCS, and we are working on a few others that we can elaborate on um they there.
00:03:38
Speaker
ah you or um And one other ad is I co-founded an organization called IDA user group. So IDA stands for artificial intelligence, data, and analytics. And that was what we did in Barcelona altogether um back in in January of this year. And as you said, yeah, i'm I've wrote a few books. I self-published a few. oh the the The books that people don't want to publish, I self-publish, so it's a little bit more creative. yeah You've got Data Contract for all ages, and but that's that's ah that's a data mesh for all ages ah in Telugu.
00:04:14
Speaker
um and i But on the on the publishing side, I wrote Spark in Action, second edition of Manning. And um I just finished with my friend, Eric Broda, the last draft of Implementing Data Mesh at O'Reilly. And it should be in your best bookstores in September. So don't always... can forward it Yeah, well, i will i will I will give you a copy because you're done. But sign in. Sign in. Sign in. It was Eric. Yeah, so he'd have to travel to Toronto, then back to New York and then he'd have to reach me. And we're not going to disclose where I am. I will i would are i want to make that happen. And in about 15 years, you'll get your book.
00:04:58
Speaker
ah ah Sounds really inspiring. Thank you for that trend. Well, you know, you live far away. ah okay It was more convenient at some point, but we all travel a lot. And finally, i you know you you also get credentials as you go through life. So I'm a lifetime IBM champion, I'm a PayPal champion, and I'm a data mesh MVP. And I think Scott is also a data mesh MVP. I think, I don't know, I started data mesh learning, so add it up. I don't care about that. Can you please tell me what is data mesh MVP? Because and MVP for me is a minimum viable product. What does it mean? I mean, in yours? ah It's probably the same thing. no but I think, originally, okay, and this is a very, and that's what I told them as a data mesh learning community. I told them that MVP is a very American-centric thing because it's ah the real acronym, or the default acronym, it's most valuable player.
00:05:57
Speaker
um yeah comes that for like It comes from sports. from sports. And I think it's really something like for baseball, basketball and and all these things. So you aren't good yeah you've got you've got the MVP of the year and all those things. ok so yeah So now you go back to Europe and in Europe everybody thinks it's a minimum viable product. So so so you pick which one I am. okay Okay. Okay. I didn't know that you're this important. Up to two books. I'll send you my invoice after it. Don't worry. cause
00:06:28
Speaker
Scott, are you can ah youre going to cover that with me each other? The fun of hosting a podcast is every once in a while you invite somebody on and they say that. They go, yeah, yeah, I'll appear, but i here here's my appearance fee. And you're just like, no, never mind. I didn't know ah people um yeah this is the inside of today. People want and to get money for that. Oh, wow. Well, okay. We have been on some data. We need to figure out something, but, um, <unk>y thank you so much for joining. in Like I'm excited. I, you know, when you kind of lined up everything you doing right now, uh, it's good. Just gone. Uh, you know, you listed everything, every project where you participate and I'm kind of angry.
00:07:11
Speaker
How do you find time to do all of that? And where are the most attention of yours going? Well, you know, all all of those projects are and interconnected. okay So so so it's it's it's it's it's a little bit easier to find time. But but the thing is, I'm i'm definitely grateful for ah for for my company to leave me yeah to leave me some time for for some of my side activities. um But but it's yeah it's not like you know it's not like i was I would be volunteering at a at ah at a food bank, for example. okay So when I work on Beatle, it actually serves ABIA data as well. ah when when When I work with FIDI user a group, it's also helping Beatle.
00:07:57
Speaker
And so you know it's kind of all interconnected. I wish I had more time for doing a little bit more chartings. But so the way i the way I help charities is is so the profit of my little books, the books I self-publish, most of the profits are going to two charities, which ah one is all the girls who could and the other one is black girls could. So but that's yeah, but but that's how I Wow, you know that. Oh, good to know. Good to be here. Nice.
00:08:30
Speaker
Listen, I want to go deeper into the B-TOL and ABI data. Can you please tell us what are you guys building at ABI data? Am I pronouncing it correct? I feel like I'm not. It's now it's like the B, you know, it's ABI data. Think of the B. ABI. Oh, it's B. Yeah. Okay. ABI data. So what are you doing at ABI data and B-TOL, like how much it's interconnected and what is, you know, the difference? So so so so well let let's let's let's start with B2L. I think B2L in our context today is is even more important. um So when when I was working at PayPal, we successfully built a data mesh and we went to production

Data Contracts and Open Sourcing

00:09:12
Speaker
with it.
00:09:12
Speaker
and And as we went through this journey, um we realized that one of the key important elements, artifacts, that would be really very helpful for us was data contracts. And they were kind of, they started to be a little bit, at the very beginning, they kind of they were a little bit like a resource descriptor for data products. okay so that thats So that's how it started a little bit. And we realized that there was a lot more needs for for for ah for um for data contracts. And as we were working on them, okay um we realized that, well, if we want to expand that, it might be a good idea to open source it.
00:09:55
Speaker
And so I went through the legal process at PayPal to make sure that it was open sourced and it became instantly a very popular project, an open source project in the PayPal catalog of product. And I think it's still in the top 10, something like that, in the active project. oh and so then ah Then things happened and they left ba um but I but I didn't feel like anyone at PayPal would be going to maintain the open source version of the data.
00:10:34
Speaker
so So I decided to so to bringing to bring it to other are um you know to ah to to to somewhere else and I could have done it on my own and say, oh, this is this is my version of a data contract, but I needed stability, I needed stability i needed credibility. So I brought it to the Linux Foundation. okay So I went to see the Linux Foundation and I explained the idea of the project and saying very clearly is that data contract was the first step. okay And then we wanted to develop more standards around modern data engineering. and And we already had this great asset. So the goal for us was to
00:11:18
Speaker
then create, well, for for between so the the linux the Linux Foundation and myself, okay so the goal was to create a group of sufficient size that would be interested in that. So this is where we are back now. So the the open source version of the data contract but from paper from a PayPal perspective, that was May 1st, 2023, And in um in September, I approached the Linux Foundation. And in november um in October, I did a tour of Europe, like I'm kind of starting to get used to ah to recruit all my um all my ah partners in crimes there.
00:11:58
Speaker
And ah by ah by november by November, we actually announced officially the creation of Bitol and ODCS as part of Bitol. And we have oh we have indeed Europeans, but we also have American people and we have people in Asia as well. but So it's really, i had I had our monthly call this morning at 7 a.m. um So, it's hey that's that's you were to do global business that's what happens okay um and it was um it was um yeah that that's how we started it and one important thing about the about the TSC so the technical steering committee for for this project is
00:12:40
Speaker
I wanted to make sure that we had three groups of people in it. So software vendors, service providers, and users. okay and And we have a good mix of... It's it's roughly balanced. okay We are about 15 and there's about five of each five of each group in in in the... ah in the TSC, because I really wanted to have the different perspective on that. okay And we have also ah geographically different different ah different perspective. We have ah this category of users. um but um And we also have, you know culturally, we've got young people, we've got a little bit older people as well with more experience, um et cetera. So um so it's it's a really... Yeah, I lost him as well. Oh, you're here.
00:13:27
Speaker
So we lost you, JGP. We lost you for a good site. You were saying that you had software vendors and you had service providers and practitioners, and then you went into a little bit of why, but we lost you right at the beginning

Interoperability Challenges

00:13:39
Speaker
of why. Okay. So I started the why again. So, but it should be recorded locally as well. No, but you just didn't hear it, right? Yeah, we didn't hear it. So we can't respond. Yeah. Yes. You just have to agree. Okay. So, so, so, so why we, so Why I wanted to have these three groups of um of of people is to have this diverse vision of what a data contract should be. okay so it's not driven One important thing was I didn't want it to be driven by only um um a vendor.
00:14:14
Speaker
mar and And you've got a lot of standards like that you find outside. A software vendor decided to push a standard and basically they open sourced the standard or they created an open standard from their proprietary from that property standard but But there's no adoption. okay i mean there's no There's no not a real willingness to adopt the standard if it's just one company or a couple of companies or even if it's like three software vendors getting together. So this is why this is why I wanted to have this approach and this is also why
00:14:51
Speaker
we are pushing the standard first, and when we're pushing the standard and when we're describing the standard, we are not thinking about its implementation, or we're not thinking about how painful it could be to develop the tool around it. I don't really care about that point. That's an issue for the software vendors. okay But we wanted to make it sure that it's comprehensive, that it's understandable but by users, and that it makes sense. okay and yes and yeah and like and i think this So I want to pretend like you took some learnings from what I did with data mesh learning and the issues that I had with the community. Because in the early days of the community, I was the driving factor. right And we've gotten more in there, but there were so many people that were like, Scott, can I have permission? Scott, can I have permission? So I became a single bottleneck versus having multiple people aligned before you start heading down a path. Because I was already so far down a path
00:15:42
Speaker
of of you know here's the content, here's what I've learned, here's all this stuff, that it becomes difficult. and and I find this in the data space. right This is exactly what you said of one company or a couple of companies open sourcing a standard. Even if they're an implementer, a lot of times there is something that is so, so tied to their view of the world, their way of working, that it it's you know my way or the highway instead of how do we create the information superhighway. you know How do we create the these ways of ah making the work easier for people? I don't see that often in data and it frustrates the hell out of me because so much of the stuff is here is all of the capabilities instead of
00:16:26
Speaker
Here's what we need to do and why. And so a lot of what you're talking about there is like what needs to happen for this to be useful for people instead of what are all the cool things we can do or what are all the cool features we can sell to people. It's so much more about like. Here's what people need to do, and people are going to come in with stuff that's moderately compliant, that's not going to have every single thing. you know The software vendors aren't going to be able to come in and fully ah meet the entire standard. But you have an assumption of interoperability. You have this assumption.
00:16:58
Speaker
which again is another problem in data that there's no interoperability. And so you have these custom bridges between software A and software are B. And then as soon as software A changes or software B changes, it breaks. And maybe one of those vendors ah maintains the bridge between those, but they only have a reason to do that as long as they as it's commercially viable. And so ah exactly what you're talking about of, you know we were talking in the pre-call, I am so frustrated by the data space by having complete lack of standards across almost everything. And Yulia was throwing out some that I don't even think are standards, because again, they're open-sourced by one or two people or a vendor. you know the The open lineages, the open metadata, is the these types of things that are great ideas, but then they're controlled by such a small party instead of a diverse party that they don't end up incorporating
00:17:48
Speaker
broader ideas that are good ideas, they end up in these very pigeonhole things. and so they're really they They take off because they're they're pretty complete for a while, and then they just languish at some point when somebody makes a decision that other people don't agree with. so I'll shut up for a second, but I think it's really important to talk about in the data space why we don't have standards and why this holds us back so much more than the software space. For the software space, people don't generally, you know, you have to worry a little bit about interoperability because sometimes the tools break and things like that, but somebody fixes it pretty quickly versus in the data space. There's no incentive. Like the vendors are highly incentive to not go into these open standards. I have, I have hypothesis.
00:18:34
Speaker
First of all, I don't agree that in software, there is now problems with connectivity of different portion. There is plenty of that. I just think that software engineering community is bigger than data engineering or data community firsthand because of the industry maturity. For a long time, everyone wanted to be a software engineer instead of data engineer as of today, but it's like the brace of data space right now.
00:19:04
Speaker
the The open source things in data are also lagging because of the smaller community and you know being earlier down the road than software engineering. But what is really interesting is, are we able to have this much of open source solution in data field as in software engineering? And because I think data is viewed as a very, how do you say, proprietary thing, because it's basically in a core, ah this is a reason to compete for a lot of organizations. And yeah, we can use open source things, but is it going to have this much of a raise in the open source for data as in software engineering?
00:19:50
Speaker
Yeah, GGP, I'd love to hear how you think about that and as well. the because it that were not As data people, we're not forcing... like if If you're in software and you're not open source or you're not open standards, you're not complying with stuff, people just say, we're not going to use you, but that doesn't happen in

Open Source as Strategy

00:20:06
Speaker
data. Is it that we need to band together as practitioners and force vendors to do this? like how do you think about what you know you Did you want to do this because you thought it was better for the industry, you thought it was better for commercialization. like why Why are you doing this open source site on the software side? A lot of times it's that yeah you can't get in front of people if you're not open source, but data, it's not the same case. Yeah, but ah's there's a lot going on. I'm sorry about that. not not but so so so No, no, no. it's it's a it's a it's a It's a great conversation and and i honestly, I love it. um so So first thing is,
00:20:41
Speaker
um I think i think you you youre youre you're your're you're right. i think Everything is not rosy in the software engineering world. I often say that i I'm ah and a software engineer by by training, but I self-identify as a the data engineer. um I think that open source is, when you're trying to sell a software solution, and if you don't have a at least a part of it, which is which is open source, ah it's going to be it's going to be very difficult. okay and And that's whether it's software or in the data world. and And I'm just thinking about, look, in front of my screen, I have Airflow, Spark, and DBT, and Kafka. So there's there's there's all of that.
00:21:22
Speaker
you you All of that is is pretty open source. And and of course, there's commercial version of du all of that. Okay, so dbt there's a cloud version spark there is all the data break stuff and as you're in and all the embedded everywhere part and. So so so there's so we we need to make sure that we are not confusing open source as, for me open source is just another business model. okay it's not it's not Nobody's doing that because they're just kind and they want to make things free for everybody. um So it's just a business model. A while back as an anecdote and going back to software, um I met the people at IBM that
00:22:06
Speaker
Open source eclipse created the eclipse foundation ah early granted them and make sure that they built the fantastic tool that it was at least for the as the beginning of its life now now I yeah have a little bit but but that's another newbie um But and um but so one of the big motivations by that was to really annoy Microsoft with Visual Studio and and to um and to have an open source tool. okay So that was clearly a business, a use ah of an open source in a business strategy.
00:22:41
Speaker
but And I think we're we're still we're still rare. But what I see as well in data is that data, and especially when you're talking to enterprise, because as always, whether you're talking to smaller companies or if you're talking to enterprise.

Empowering Smaller Vendors

00:22:54
Speaker
But when you're talking to enterprise, a lot of them want to develop their own stuff. okay And then you end up with ah hundreds of thousands of data pipelines. Or um or they want to or they want they want to trust the big players, the IBM, the the Oracle of the world, ah and and ah or those days it's more like Databricks and Snowflake.
00:23:19
Speaker
um but um And they want to and they they trust that or they they they have this impression of trusting a company of, you know, you know as the old quotes from the 80s, okay, nobody got fired because they bought IBM, okay. And now it's probably nobody got fired because they bought AWS or nobody, okay. so so so So I think i think we're we're still there, but one of them one of the motivations behind creating open standards it was to enable the small guys where there's a lot of innovation um to be able to interoperate together.
00:23:53
Speaker
okay So um i'm I'm thinking of um my my friends at Data Contract Manager okay or data data the Data Mesh Manager in Germany. ah If we want to be able to interoperate with them, we've got to have the same format, the same understanding of the structure, ah so that I can, for example, if I create data products and reference them in the tool. Or if there's a data catalog, I want to be able to share my data catalog and the richness of my data products into the data catalog. And if everybody's having this this different format, then that's that's what happens. You've got lack of interoperability, you've got proprietary solutions.
00:24:34
Speaker
I've heard so many times ah vendors from, I mean sales people from major vendors telling things like, okay, if you use my BEAT platform, um i ah it's all going to be good. And on on all your data problem will be solved. And I think they're not even lying, but it means that you've got to completely migrate all their all your stuff to their platform to really take the benefit of that. So is that something you want to do? Well, maybe, but if you are a major company, you're never going to do it because you've got 30,000 data pipelines that you've got to migrate to their new tool. And then the the problem is is going to ah a appear again. And and ah so so, yeah, I think i think i think really the big motivation behind Beetle is to enable the small players, or the smaller players.
00:25:24
Speaker
And I have a question on that. As you mentioned that you open source the PayPal data contracting as a VTOL or which is a part of Linux Foundation. So now you're saying that the smaller players can use it and make things, you know, nice from scratch, right? But there is a challenge. the data contract things needed to to be implemented in PayPal. And that is because of the yeah size of the organization diversity of the organization data silos, like all of that is implied, you know, a big enterprise with a lot of regulations, especially in the um financial industry, not saying that this is the thing that we just open source for smaller players to implement.
00:26:14
Speaker
Does any smaller player have the same challenges as PayPal? It's a different scale company doesn't need it at all. when When I'm a smaller player, i'm I'm more thinking from the vendor perspective, okay from from the software vendor perspective, the tool makers. If you find five five different tools, and you've and and you may need five different tools, let's say and especially you let's say you want a tool for data quality, a tool for tracking lineage, a data marketplace, and observability. okay let's Let's go for with these four tools.
00:26:46
Speaker
if you have data engineers, needs to configure the four different tools different because they have different UI, they may not have APIs, the choice of the tool is made by someone who is probably not having, it could be an enterprise architect, but it could be also something like departmental views of that. So you see all these people with different tools and just integration of all that is going to be a mess but because you what where is going to be your source of truth for your metadata. Everybody talks about your source of truth for your data, but there's also a notion of very more much more important notion now, I think, that is a source of truth for your metadata. okay where What defines my business rules? What defines my data quality rules?
00:27:31
Speaker
and um And I think, and that's that's that's a belief on the promise of Bitole, is that if you use something like ODCs, well, if software vendors adopt ODCs, then they can you can have one source of t truth for your metadata, and then you define at one place your data quality. You define at one place your schema, and then it becomes your source of truth for your schema. You define at one place the list of stakeholders for your governance. You define out one place in your world for security, etc. okay And then you make sure that this source of truth is being used by the multiple tools you have in your enterprise. And this allows smaller companies to be able um
00:28:13
Speaker
ah um Smaller companies to actually bolt things around so that they they can more easily go to go to market with with larger enterprise. Okay, so I'm thinking out about companies like ours via data or yours. Okay, must add if if you're going to say, okay, I'm i'm following open standards and I'm for I'm full i'm um ah It means that interoperability is greater, that you've got to do one one times the work for everybody, et cetera. So so but that's that's that's the idea behind it. Sorry, have a quick question, I think. I think it's quick, but you say one source of truth. I have this real big problem with this of forcing people to use one tool.
00:28:55
Speaker
instead of define what they need to define in their workflow and it flows to everything else appropriately. How do you think about that? Because again, you talked about other tools require everybody to completely rebuild the way they do everything. you I was having this this conversation with a friend and we were talking a little bit more about marketing and being in somebody being in market or not in market you know when they're ready to buy versus when you intersect with them. But, you know, like if you have a marriage, but the two of you are both married, you might meet somebody and you're like, wow, I have a great spark with this person, but I'm not in market for a major change. You know, I'm happy with my relationship and it's not worth it. Well, but it's this it's this exact thing of.
00:29:38
Speaker
Do you think with this standard, you have to change absolutely everything to start to to do it? Or do you have one single place where they do it? Or do you have, Hey, developer, wherever you're updating your code, wherever you're updating your work. you're doing the work in that place and it flows appropriately versus you now have to go to this other. This is the problem that everybody has with catalogs is none of the people that are doing the work want to go into the dang catalog itself and update all the thing because they do all of the the coding work somewhere else. And then they go, okay, now I have to go into the catalog and it doesn't work with their their workflows at all. So like, how do you think about that? And so again, i it might not be the smallest question, but I do think it is a small question of just like,
00:30:25
Speaker
How do you think about not disrupting people's workflows so they can do the damn work that they want to do instead of jump into yet another tool? So so so that that's that's ah an excellent point, Scott. And and and that's one of, as I said, I didn't want to to oversell a bit of data here. it's But but but this is this is this is really something that drives us the way we're we're we're thinking. okay ah First, we we don't want people to go to other tools. If you're comfortable in your notebook as a data scientist,
00:30:59
Speaker
I don't want you to go click 20 times to to go find out something in Collibra. For example, I'm not picking on Collibra there. but the um if you are so i think i think and I think it's just a ah good software practice to bring the information to where as the user is. okay and and is the days of the central portal, which is never updated, are for me, are gone. guys So so that's that's one thing. Regarding where the work is happening, I think it it should it's exactly the same thing. okay Your data scientist is in his notebook, is looking at the data, sees something that is weird. Well, the feedback loop just takes it exit from there. okay um and Or you've got the data engineers that is in DBT,
00:31:48
Speaker
uh, building his pipeline. Well, as he builds a pipeline, the data contract is checked, verified, bold, modified, whatever. Okay. And at the end of the, at the end of the game, it can be eventually ah enriched by the data product owner or, um, or, or any product owner and then pushed to the catalog. And then it's, it's flow. It's, as you say, it's dynamic. It's just this, this, this pieces of information just flows into your architecture and, and, and are surfaced where they need to be surfaced.
00:32:19
Speaker
So my question on that, and I just wrote down the point that you said that to bring the information where the user is. When we build the product, and as a product manager, I think that we need to come to where the user is already. And so I want to understand you better. So you are saying E. ah potential customers is going to implement OTC as an open source data contract center. It's going to be easier for us as a vendor to come in and integrate with that, right?
00:32:57
Speaker
It will take forever when they will have this adoption and implementing open source data contract itself is not a task that could be done overnight. I don't know how much time it will take for the large organization, maybe even medium organization to adopt it first. And then I'm like, is it do you see the ten like the trend for you to be this you know to have this viral adoption where all the vendors would want to have
00:33:28
Speaker
integration and be, you know what I'm saying? like Yeah. so so so so So the thing is, there's there's always green field and brown field. okay so so and And I think that's where that's where tooling comes into

Tooling for Data Contracts

00:33:42
Speaker
play. okay The thing is, um um i am I going to write YAML by hand for, let's say, let's say let's say Let's take the example of of a data pipeline. okay So a data pipeline is taking data from somewhere and pushing it somewhere else. I guess like probably we can all agree on that. um at a So if you want to have
00:34:07
Speaker
In an ideal scenario, you would have two data contracts. You can have a data contract at the output of your data pipelines, defining basically what is a promise of your data pipeline, right ah and saying, okay, this is going to be the schema, this is going to be the documentation, these are going to be the SLAs, etc., etc. um cra But you can imagine also having a data contract at the input of your pipeline, okay saying that, OK, this is a data. um this is This is a schema of the data I'm expecting. ah This is when I'm expecting it to be updated. okay So let's say it's 9 AM. I want to run my pipeline. And the data should have been updated at 8. But it's actually not updated yet at 9. So my data pipeline is actually not my my input contract is not valid. And if you've got the proper tooling for that,
00:34:53
Speaker
Well, the thing is, you your pipelines, the development effort for your pipeline and is going to be lower, right? Because you don't have to validate all these things. You just rely on the trust that the construct is giving you. So let's say I've got a company with 10,000 data pipelines, which is those days not completely surprising. That means that, oh, I've got to write 20,000 data contracts to take my brown field to the next step. So that's one approach. um And if you're going to say, to go see a CIO or a CEO and say, well, you'll get benefit of that when you've got when you written the 20,000 data pipelines, then, of course, you're going to have a your hass is going your ass is going to hurt because you're going to be kicked very hard.
00:35:38
Speaker
okay um so so So what when i'm when i'm when i'm what i'm what what I'm saying is, first the tooling needs to be needs to be helping you in that. okay So i'm not I'm not saying that um a lot of the things you find in the data control can actually be created with tooling. ah so So we could extract a lot of information from your existing pipelines, from the outcome of your data of your pipeline, and creating the contract as well. okay So that you've got a baseline. And you don't have to do it for 20,000 times. You can do it for the critical ones, or as you see failure happens. okay um I often took the example of of of my son, who recently started a job as ah as a data engineer. okay
00:36:23
Speaker
and he's got in charge of a very important pipeline and sometimes it fails. okay But if if if they add data contract for this very important data pipeline, well, it would already save them quite quite a bit of trouble. okay so so so So it's like everything in in in a kind of a migration view of the thing is you don't have to migrate everything to see already the value. And and that's what i'm that's what I'm telling our our you know our customers and and and prospect and the people and the partners I'm seeing along when I'm touring Europe or the US. You don't have to do it all now to see the value and to to get value out of that.
00:37:05
Speaker
you can You can isolate maybe one, two, three pipelines you want to isolate. And and and and and really, you know and i'm I'm going to offend a lot of people by saying that. But the thing is, let's go back to my example of my data contract. Then I've got a pipeline at the beginning. I've got a pipeline at the end. What can you call this new thing? okay You've got a pipeline and two contracts. maybe the data product Yes. Exactly. Okay. So, so the thing is then, then you you can more easily assemble. Okay. And I posted something on LinkedIn, which for the first time ever on LinkedIn, I stopped the ideas, allowed commands. I said that it, yeah.
00:37:46
Speaker
The data product is an organized set of data contracts. okay And people, a lot of people were just piston offended. But whatever, the thing is, and we've got an expression in France saying, we don't make a nobody without breaking eggs. So if I offend a little bit, people, too bad. But the thing is, ah but but but it's it's that. okay And the thing is then you can create on your pipeline, if your pipelines become a data product, okay yes, it's not strict to a census, part of the data mesh because it may not be aligned to a domain. ah But you already have product thinking. But you already have your data product assigned to a use case, because your pipeline was assigned to a use case. And, Scott, I remember one of your episodes of data mesh learning, where I think it was when you were talking with Vista, where they said they don't align their data products to domains anymore, but more to use case. And that's the perfect scenario there.
00:38:41
Speaker
Well, and Yulia, your question was a little bit of, cause you know, JGP went in a different direction, but your question was a little bit of, you know, how much migration work do I have to do? But as well as a vendor and why are vendors incented to align to this? And again, it's what I go back to with the software versus the data world. Our users, our practitioners aren't demanding it, right? Part of the reason that the data world is so broken is because the data world gets most of the information flow from vendors. And those vendors are are pushing their own worldview. And so when all of the power, the information power out there as to say what is good and what is bad is so heavily tilted in the way of the vendor,
00:39:26
Speaker
they're going to do the thing that's more commercially viable until we have users rise up and revolt right and say, you must do this or we won't bring you in. And and like we're not seeing that in the data space. We see that in the software space. you know I worked at data stacks, open source, Apache Cassandra, now they do Pulsar and all sorts of other crap. But there were multiple times where we would be trying to sell data stacks enterprise, which was not open source. It was built on top of the open source, but it wasn't open source. They'd be like, we're not going to bring in anything that's that's not open source. There are many you know many companies where they say for this critical infrastructure, we will not bring in anything that is not open source. That's changed somewhat, especially with what MongoDB and Elasticsearch and those have done where they're no longer open source. They're just kind of um source available or whatever.
00:40:19
Speaker
um you They have server-side licensing. Let's not get into open source licensing issues. But we're not seeing like a massive reason that the data space is so broken and so fragmented and and that nothing interoperates with each other and that we have no standards is because we don't have people demanding that if you want to sell into my company, You have to comply to something and so vendors aren't incented to and you know, you're going to have to go to every single vendor and have a one to one solution like I pass shouldn't exist, you know, um integration platform as a service. It's it's literally connectors between this service and this service this service and this service and their billion dollar revenue businesses.
00:41:03
Speaker
multiple of them that have a billion dollars of revenue just connecting the two damn things that should have already been connected because they should have an open standard, but don't. And and yes, it's not always that case in the software world as well. you know Oracle's fought against everything, but they've had you know people trapped in there and their sphere of influence for years and years. But you start to see like IBM is very, very heavy open source because you know they bought Red Hat and they do a bunch of this other stuff. You're starting to see more and more of that. ah you know AWS, a bunch of their stuff is open source. But in the data space, nobody's requiring it, so why would you do it? why Why would you as a vendor go it and really work hard on complying with an open standard if if if it's not going to result in lots of business? I have a theory. I think that before the code was considered as a competitive advantage for organizations maybe like 20 years ago, the code of the services
00:41:58
Speaker
And it turned into the legacy really quick. But when we're talking about open source code, i because it's maintained by the community, it's really turned into the legacy. and You know, people paying with their time to make it, to publish it and keep updating and editing things and so on and so forth. So you have to invest so much in your internal code to maintain to make sure it's not outdated, to make sure it's keeping up with all the versions, ah keeping up with use cases in the organization. And there is an open source. We started to make sense to use the open source versions, all the code. And all of a sudden, 95% of organization code right now is open source because you don't need to maintain it as you you used before. But right now,
00:42:50
Speaker
Everything is shifting to the yeah data that this open source actually outputs for the organizations. And I think that this is the level of maturity ah of data industry overall. We might be there at the same position in five years or so where we have everything open source. you know But we're not there because of Maturity, I guess so as a whole industry because data pipeline could be considered um Excel that is sent via email on the Friday night. It still could be a pipeline. Okay.
00:43:31
Speaker
Well, and I think, I think also one point that you kind of put touched on in there, that's important is that I started to think about is you don't open source until your code's not sloppy and oh baby data code is sloppy. There is so much one off. There is so few CICD, there is so much, and at, you know, GGP, you were talking about some data processing tools, you know, Airflow, Spark, you know, Kafka, you know, dbt. Those to me aren't data tools. They, they are like data. transformation tools, and a lot of those are open source snowflake is being one that's not, but everything else, all the metadata stuff, all of the, every other thing that you're, you're spending your time in is non open source. And so, but again, it's because users aren't demanding it versus.
00:44:16
Speaker
You know, we're, we're, we're seeing that there is, um, so I think really, you made a really good point in there of, of like, are we even ready as an industry and data to open source anything? Cause you don't want super, super sloppy code to be open sourced. I want to say something when you say demanding, uh, in your case, it means, um, not choosing. So it's purchasing. Yeah. Yeah, it was denial of using something if it's not an open source. and light If we're going to reach the point where users are just going to deny using any solution, and it wonder if there is an open source version of that, I'm not sure because people are desperate today to make things happen with the data. and A lot of that is because of the pressure of AI and the rest. and I think we need to head it over to GGP because like it's us talking.
00:45:09
Speaker
and And it's fine if you want to. ah It is our show. We can we could do whatever we want. you we have Exactly. It's your show. And i'm i'm I'm a guest here, so I just shut up. No, but i ah yeah I think we need to we to to. That's why I'm going back to something I said just just a little bit before.

Governance and Open Standards

00:45:30
Speaker
I think open source in software is a business model. good um and and that and the business model had to evolve. And I think the case, for example, for for Elastic um and is interesting. okay So they they really embraced the idea of open source. And I remember like in the mid-2020s, we used that. And I think we we started with Elastic 0.6 in one of my previous startup ah
00:46:03
Speaker
And um it was it was a great tool, then we followed it, etc. But then people like like the cloud provider came in and just said, oh well, we've got but we can build a lot of great services for free because all the engineering has been done by those companies. And that's where people like Elastic were kind of pissed when they completely lost control of their revenue making. part of it because they were competing with someone that did not have the engineering fees to do that. yeah so So that's an ethics. Or ethics. like it Don't get me started on AWS and open source ethics. Oh, gosh. I've got lots.
00:46:42
Speaker
Yeah, i'm i'm not i'm not i'm not i'm not I'm not even going there. um but But the thing is, it's it's true. There's there's really there's really there's really yeah ah a lot of things but um going there from from the open source. But when we're coming to open standards, so when you're thinking about um HTML, for example, just imagine the mess. or or the lack of mess because maybe the web wouldn't be where it is if the browsers did not support it um HTML. guy ah and i was there And some other proprietary extensions and we see how they actually are not being used anymore. um guys so so So I think that it's the WS3C as a standardization body did a pretty awesome job there as well. okay
00:47:25
Speaker
We would not be able to use emails if the IETF didn't organize this as as standards and makes open standards with clear with peer views as well and how to govern those standards. So I think this is also something that is that is really important to see. is you see it's not A standard is not open because someone put open in front of the world, right? It's it's also you've got to look at the governance who owns in a way the repositories, who owns the logistics around that. okay so And I think that's that's where oh being also with the Zelenix Foundation makes a lot of difference. um Otherwise, it's just a way of doing proprietary things.
00:48:06
Speaker
i and I've got thoughts on Linux versus ada or versus ASF as well. But like I think what you're saying there, it made me think about as well, is part of the reason that data people, ah you know and somewhat software people, but especially data people want to build it themselves, want to do everything from scratch. And so you know the the people that are building toolings that are are doing vendors, arts aren't as interested in these open standards because they can do it better themselves until they actually put you know code to to IDE and then they find out that they they um you know that they might have cleaned up you know six of the problems and then created another 100 when they're in year three. like Do you think that that we're seeing
00:48:47
Speaker
because I agree with you that not everything needs to be open source, open source, but that people need to comply with open standards so things can play well together, but we're just not seeing it. like why Do you have a ah good reason for why we don't see the open standard side, not just open source as a business model, but open standards? Well, if if I don't want to to pick on ah on a major vendor, but we've we've got to understand what would be their motivation for doing it. yeah oh I've got all my cloud services that are working kind of nicely together. um Why would I open the gates to a potential competitor?
00:49:26
Speaker
Yeah, it's it's that the main pane of glass makes you a major pane in the ass thing that I talk about. I have have a theory on that. So when you mentioned the open standards or sending emails, like these kind of things, I think that innovation happened because of mobile adoption of this innovation itself. Like if you think about the crowd that adopted emails, keep adopting basically this thing as an email to open source data contracts. It's a different number of a crowd. Okay. And this is the reason why it's not that widespread as well. And it will take longer yeah to take it off, you know,
00:50:16
Speaker
Yeah, you know and and um you know i'm i'm i' I would love a faster adoption of data contracts and open standard data contracts, but but it it takes time. it's a thing and I don't have the marketing of big companies, which are actually, as we said, not really motivated by them. yeah So it's yeah it's it's all linked. A little bit, but but the growth is actually rating. I see that by the numbers of stars and and that's how we that's one way we can measure the popularity. You wanna brag? You wanna brag?
00:50:57
Speaker
She's saying, do you want to say how many stars you got? You want to and name-drop your stars? Yeah, thank you. Thank you for the conversation. So it's a bit frustrating in a way, because the data the data contract template we did at PayPal has more stars than the beto one. okay so But anyways, right out of today, we were at 263 stars. So it's not huge by any means. It's also a standard. okay It's not like, oh, it's Kubernetes, or it's a fun ML library, or the latest generation of RAG. okay It's a standard. um so So I don't have a lot of metrics to compare it to. but But the thing is, I see the efforts, the energy being put in. And you're and you're right. okay The thing is, it's also it's also a little bit more slow to deploy. We're working on version 3.
00:51:42
Speaker
but Today we made a major step towards version 3. And it's it's a democratic process. okay so and we know And we know that the democratic process takes more time. yeah And I hope we will remember that in November. and not won we won We want simple answers. We don't, we don't want complicated. It depends. We want simple answers. That's as humans. And that's why we keep having these same stupid problems across the world. More than that.
00:52:18
Speaker
When you mentioned that you invited three different cohorts of people to kind of observe this beautiful thing, like vendors, users, and I didn't remember who was the third one, but it resonate like it reminded me of three branches of- Government? Yeah. led to Yeah, the government power, like in a legislation, or it- Yeah, executive and judicial. Yeah. thanks yeah yeah Well, there is it is it is i didn't I didn't make the link when when I decided that. And to be honest, there's there's a fourth one which which is also knocking at the door. It's academia. okay So so i so i'm i've i've got I've got some interest from once more people in Europe in joining that. And um and i think I think that would be a great addition as well.
00:53:07
Speaker
yeah ah The other thing with academia, I worked with them before, they don't have the same pace and they are la they they produce great results at a slower pace. so but but so But it could be interesting to bring them in as well. But as I said, the thing is it's it's a democratic process, it involves more people, so it's yeah it's it's easier to to it's it's easier to run a dictatorship. personal addict teacher should detector depend your sharp for right To understand, I have peeking on me on that unsuccessful article where I said the, uh, tyranny of how, how did they did that?
00:53:42
Speaker
I knew, do you remember I told you the story where like, yeah, the disclosure, I wrote the article about ah one of the biggest brands in Europe, how they build their data platform. And I named that dictatorship of X company, built in their data platform. And they were, do you remember I shared that with you, like, yeah baseball and so referring to this case when you're No, but but but no it it is that it's no i'm I'm not referring to you to to you or to your very personal story.
00:54:14
Speaker
ah ah be i'm a happy dog to go to straight I'm more thinking about a little bit of competition in the data contract space, okay where oh where where it's easier to say, OK, I'm the leader in data contract, or i'm i'm I'm doing that. But the thing is, you have absolutely no control of what's going on inside. okay so so ah But part of part of the TSC, um part of the tsc so the Technical Steering Committee of BTOL, we have the creator of the original data contract. okay So Andrew Jones, who you invited on your show, is part of that. So so I guess that gives us a little bit of credibility as well.
00:54:53
Speaker
Just a little. Yeah, of course. Listen, GGP, it was pleasure talking to you and, you know, kind of, I don't know how many times we have been talking about this, but it's always such a pleasure to learn something you from you and how beautifully you see the world. And, you know, even that's open data standard, open data contract standard, um being democratic. and I love the story. Yeah. Where can you follow up with you? Well, I think that was a so bizz the best way is um best way is on LinkedIn to connect with me personally. There's only one Jean-Jacques there, so it's pretty easy, but it's said by the URL is slash ggparent.
00:55:39
Speaker
um and ah but But go visit betol.io, okay? And there's a lot of instructions on rick on how you can get involved. And so betol.io, that's a website where we're working on that. And and we we we will soon announce a roadmap as well, because as I said from the beginning, we are not data contract is not our only um standards, okay? And we have a full roadmap of two to five standards that are going to pop up between now and in the middle of next year. and Beto is B-I-T-O-L. That is correct. I will make sure to include the the show notes a recording. Yeah, in show notes. But GGPM, such beautiful story. Thank you so much for sharing and onboarding us. thanks Thanks for having me again. Thank you. Bye, guys. Bye.