Are querying systems fundamentally the same?
00:00:00
Speaker
I have a question about questions for you. Is an SQL query against a database the same thing as a REST query or a GraphQL query against a web server? I'd say yes, fundamentally. I mean the details are different, but at the core, they're both ways of querying a data source for interesting data.
00:00:21
Speaker
Okay, what about searching the file system for all the PDF files under a given directory? Still a query against a data source. Okay, what about code linting? Searching through source code, filtering for bad programming patterns. That one's a little harder to see, I think, but source code is structured data. So your source code is a data source and data is something we can ask questions about.
00:00:50
Speaker
Yeah, I would say a linter is a specialized query engine with a collection of prerolled queries that tell you where the bad code is. OK, one more, and this is really a question about what kinds of question we can ask.
Complexities of querying different systems
00:01:05
Speaker
Could you query GitHub to get all your repos, query all those repos for the Docker files, and then query the Docker files for which version of Python they installed, and join that to Python dot.org's official list of supported versions to find the unsupported Python repos in your organization?
00:01:29
Speaker
In theory, yes, that is just another query over a familiar set of data sources. In practice, you can't really do it because they're all different query systems tightly coupled to the different kinds of data they're querying. Could you decouple it?
00:01:47
Speaker
Could you build a universal query engine that isn't tied to the database, or the web server, or the file system, or the programming language?
Introducing Predrag Grevsky and his work
00:01:57
Speaker
I'm joined this week by Predrag Grevsky, who's convinced that we can and should join literally any data source. And if we do, some very interesting questions go from theoretical to trivial.
00:02:11
Speaker
He's the author of Trustful. It's a querying system that lets you teach it how to mix in new data sources and then treat them all as just one thing. He's also the author of Cargo Semver Checks, which can tell you how much your Rust APIs have changed recently by treating the old and new source code as two different data sources that it just then queries and joins.
00:02:37
Speaker
How does that work? What kinds of queries could we make with a system like that? And what kind of expressive power do you get when there's one query engine to rule them all? Let's find out. Let's find out by running the one kind of query Predrag still can't automate, me querying his brain. I'm your host, Chris Jenkins. This is Developer Voices, and today's voice is Predrag Grosvsky.
00:03:13
Speaker
I'm joined today by Predrag Grafsky. Predrag, how are you?
Casual chat: Boston weather and travel
00:03:17
Speaker
Doing great. Thanks for having me. Pleasure. you're um You're coming in from Boston, so presumably it's as perishingly cold there as it is here. It is quite cold, and just a couple of days I flew back from Florida, so I'd take some extra adjusting now to the more darkness and less warmth. Yeah. If there's one thing I know about Florida, it's where people retire to for the desperate attempt to find warmth year-round.
00:03:42
Speaker
Yes, and I've forgotten the need to moisturize because Florida is quite humid. And then, of course, I come back to Boston so my skin turns into cardboard. Oh, God. Right. Skincare tips live, global skincare tips live on Developer Voices. But i mean here's a link, right? you You've flown back in from Florida, flights as a source of data.
What is Trustful? A unified querying library
00:04:08
Speaker
let's Let's link in that way, because um you've been you've been working on a Rust library for, is it fair to say, for querying all the things? Yeah, thats that's a very good way to summarize it. okay Nobody ever accused me of ah being not ambitious enough.
00:04:25
Speaker
so Give me the details on this. what What do you actually want to query in practice and how? Everything that people might find useful. so Databases, APIs, file formats, ah data that is computed on demand, you know everything from and ML models to, I just need to do some fancy static analysis for a computer program and I need to decide something to do with the results.
00:04:48
Speaker
All of these things are ostensibly things one could represent in a database, but putting them into something like SQL is not necessarily the most convenient. right If you imagine an API like GitHub, GitHub isn't going to let you do something like select star from GitHub users. right No.
00:05:06
Speaker
So ah traditional SQL database isn't a great fit for all of these kinds of use cases, but that doesn't mean that we can't think of these things as if they were databases. We could ask, what are all the repositories for the logged in user, for example? Or we could ask, you know how many stars have ah the people who have contributed to a particular repository made on other projects? right All of these are well-formed queries over the GitHub API.
00:05:32
Speaker
yeah We could make similar well-formed queries over, like you mentioned, flight data, right whether flights are late or you know which airlines fly to which places and what kind of aircraft do they fly and how often are flights delayed and for what reasons and so on. and Again, if we had a database with all this data, something like SQL, we should just use that, but most of the time we
Can a universal query engine minimize work?
00:05:53
Speaker
don't. This is data that is coming live from some API, maybe in some complicated file format and some additional magic needs to happen.
00:06:01
Speaker
This is the thing, the additional magic, because I've definitely worked on projects where I've got two APIs and I pull them into a database so that I can join them. Yes. i and Every REST API is a query. right so Is the additional magic the idea that you can then join these completely different data sources? Yes. and you can More importantly, you can join them as if, you know from your perspective, they were all in the same place.
00:06:30
Speaker
So a lot of the time when you have two or more data sources, people do what's called ETL, right Extract Transform Load, where you just grab all of the data in bulk from both of the data sources or from all of the data sources, and you shove them into one place, whether it's something like Snowflake or Postgres or some other kind of data source. And now you say, okay, well, I've reduced the many data sources into one data source, now I'm just going to write SQL queries.
00:06:55
Speaker
yeah This is not great if the APIs are charging you by the hit and you're not going to query literally all of the data, right? Or if getting some of that data is computationally expensive or there might be any number of reasons why this is undesirable.
How Trustful simplifies data access
00:07:11
Speaker
with the query engine that I've built called Trustfall, you get to pretend as if everything is all in the same place. And under the hood, Trustfall figures out how to get the minimum necessary pieces of data from all of the different places to satisfy your query. So it will do no more than the absolute minimum required, literally meaning even you know if you pull in five rows, then none no work has been done on the sixth row yet.
00:07:38
Speaker
this is There's a lot of ways you have to tackle this, but the first one that comes to mind is minimum work. It has a few different definitions, right? So you mentioned cost as one, obviously time to execute is another amount of data transferred over the network. What things are you minimizing? What are you ignoring?
00:07:57
Speaker
Great question. It really depends on your use case and your constraints. In some cases, it might be cheaper to fetch more stuff if it produces better performance and you don't really care about, for example, bandwidth costs. In other cases, you want to not hit some expensive API that does some complicated you know ah AI-type transformations more than the barest minimum because there's a hard rate limit and you're going to get cut off or you're going to charge you're going to get charged a lot of money. And so there, you're going to prioritize away from you know maybe performance and towards not wasting money. All of these things are situational, and they're decisions to be made when plugging in that data source.
00:08:40
Speaker
But importantly, these are decisions that you make when plugging in the data source, not on every single query. right So in a sense, you define a policy for how you want to access that data. That policy is expressed through code, through some pieces of implementation.
00:08:55
Speaker
But the people that are writing queries over those data sources, they don't care. That policy is applied automatically. All the optimizations that are in place, all the design decisions that were made are there, and they're opaque. So the folks writing the queries don't have to worry about query performance or cost or anything like that. They just write queries.
00:09:13
Speaker
Okay. I realize we're going to have to dive into the implementation enough that we can pick apart the design choices. So trustful is the library tool. Is it a library? Is it a tool? Is it a framework? Yes. Okay. So in a sense, it's how you'd build a database if it could make no assumptions about the underlying storage.
00:09:37
Speaker
right so When you look at SQLite or Postgres, they have a query engine and they have a storage engine under the hood. and Those two are quite coupled. if you just sever the library If you sever the database in half, there's no storage, it's just a query execution layer. At that point, it's a library. It's also a framework because somebody needs to plug in data sources, so there's an API to be satisfied there.
00:10:03
Speaker
But to users, it's a query language with which they write queries and they get results back. So if you put that across a network, then to them it's a database server that lives somewhere. If you embed it into their web browser, then it's just another piece of web assembly that runs in their browser. So it could be consumed in a variety of different ways depending on the situational needs.
00:10:25
Speaker
Okay, but it is essentially a data source agnostic query engine, that's what you're building. and
Advanced features of Trustful's syntax
00:10:35
Speaker
Okay, so let's unpack that. First, how do you query it? It's not SQL, is it? No, it's not. um It's not necessarily prohibiting SQL, it's just not what I've implemented currently.
00:10:47
Speaker
and So like many query systems, like many you know programming environments, it has an internal intermediate representation that all queries get translated to. I've designed ah minimal query language on top of GraphQL syntax just to reuse what the community has built because the tooling in the GraphQL ecosystem is fantastic.
00:11:07
Speaker
So I reuse the schema and the query syntax and a lot of the capabilities of GraphQL just to get you know nice autocomplete and visualizations and things like that. right yeah But the semantics are quite different than GraphQL. So Trustful is not a GraphQL engine.
00:11:23
Speaker
because it allows things that GraphQL does not by itself support. It allows aggregations and left joins and recursion and self-referential queries and all sorts of ah more advanced query query operations than what GraphQL would get you.
00:11:39
Speaker
Nothing prevents anyone from building a SQL layer for for this, right so long as you can compile it down into the same intermediate representation that the trustful engine can then execute. So it's kind of a similar idea to LLVM. yeah right We've built the expensive optimizer bits once, and then we can plug in C and Rust and C++ plus plus and Swift and all of these other things, and everyone gets the benefits.
00:12:03
Speaker
OK, so but the library is written in Rust, yes. It is, yes. So what would I do if I decided I wanted to use Swift? i the The analogy is more about the like the queer languages compiling shared intermediate representation. OK, I see what you mean. Swift turns into LLVM just like Rust turns into LLVM.
00:12:24
Speaker
Okay, so i can write I could potentially write um what's a graph GQL as a graph query language. I could write something that compiles that syntax to your IR.
00:12:35
Speaker
And then we've got a new query language. Absolutely. OK. But it is actually possible to query trustful from other programming languages as well, because Rust is designed to be highly embeddable. So you can query trustful from Python. You can query it from WebAssembly, with JavaScript, you know with other with other technologies. So you're not just locked into Rust. OK.
Executing queries without specific storage
00:12:57
Speaker
OK. Right. So then if that sort of takes care of the syntax,
00:13:04
Speaker
We might dive back into that, but if that kind of takes care of the syntax. You've got a query engine. You are built. How do we get into um the storage layer? yeah like so which Which storage layers do you support? How do we extend it?
00:13:22
Speaker
Yeah. i The interface that Trustful has towards storage is one Rust, what's called Trait, essentially an interface for folks who might not have programmed in Rust, that asks the implementer to build four functions to provide to Trustful. can ah In Trustful, all queries are queries over graphs. So we consider everything as a graph.
00:13:45
Speaker
This is not any different than relational, ah the relational databases, ah you know, representation of data. It's just that instead of rows, we call them vertices. Instead of foreign key relationships, we call them edges, right? So there's ah there's a mapping there. Graphs make some things is a little bit more elegant. So we talk about graphs.
00:14:05
Speaker
OK. In our graphs, every vertex has a type. So you might say, for example, in the GitHub example, some things are repositories, and some things are user profiles, and some things are commits. A commit has certain properties, like when it was made, you know who the author was, like what is the author name, what are the contents. And a repository has some other properties, like what organization or user account is part of, or you know what language it's written in, and so on and so on.
00:14:33
Speaker
So, every type of vertex has a set of properties and a set of edges that are defined by a schema, and then they have interrelationships. So, you might say, you know, the ah repository trustful was created by an author named Predra Grefsky, and there's an edge to the, you know, Predrag's GitHub account profile. Okay, yeah.
00:14:57
Speaker
The interface to the underlying data representation, which in this case is the GitHub API, is those four functions which do four relatively straightforward things. The first thing is for some given set of vertices, resolve some property. So let's say that I have a bunch of GitHub user accounts and I want to get yeah their creation date.
00:15:19
Speaker
That is a property on GitHub user account and I can resolve it. And trustful says, some query needs this piece of information.
Conceptualizing data as graphs in Trustful
00:15:26
Speaker
I have a bunch of user vertices. Please tell me how to find this data. And that's going to essentially hand you the JSON and you pull out the field for it. Right, okay exactly. Similarly, this is for properties, but I mentioned there are also edges.
00:15:42
Speaker
There might be an edge like, you know for this user, tell me all the repositories they own. So again, there's a second function for resolving edges for a set of vertices. right So here are some users, please give me the repositories they own. ah on From the you know storage implementation perspective, you know you get whatever the payload is that corresponds to users, and you just pull out the appropriate bits of the the JSON. This might involve you know making some additional requests upstream to the GitHub API, or whatever the implementation is, you might have some caching or something like that.
00:16:15
Speaker
but that This is like some queries are going to return me a user that has a list of all their repos, but some other queries will return a user that has an ID, which I then go off and get the repos from. ah Exactly, because the repos endpoint might be paginated because some users might have thousands of repos, and so GitHub is not going to return all of them. Right. Okay.
00:16:39
Speaker
so Vertices, who two different kinds of edges in a way, where properties are a sort of special case of edges. yeah Right, exactly. Properties, edges. ah ah Getting vertices is obviously one, right? So, querying has to start from somewhere, right? So, get the currently logged in user, for example. Or like look up a repository by name.
00:17:05
Speaker
easy, get the get some vertices to start querying with. And then the final one is we actually have a full type hierarchy between vertices. So on GitHub, you might have noticed how like yeah we you know when you do like hashtag 741 in a comment, it can link to an issue or a pull request or a discussion or something else. yeah right So there's some shared concept there of like a numberable item.
00:17:32
Speaker
And that is a vertex in itself, because when it gets mentioned you know in a comment, you don't necessarily know what it resolves to. But it could be an issue, or it could be a pull request, or it could be discussion, or it could be something else. And so an issue is a subtype of this numberable item, and so is a pull request, and so is a discussion entry. So the fourth function is just, I have some vertices that are of one kind, and I would like to do essentially an is instance check for some subclass.
00:18:01
Speaker
right I want to check if there are a more derived form of the thing that I believed that I have. right yeah That instantly is throwing up a problem in my mind where like if I've got this hash reference, do I then need to go to four different GitHub endpoints to resolve it to figure out which of the four different kinds of things it could be?
00:18:24
Speaker
It depends on how the upstream API is designed. So in GitHub's case, I believe the answer is no. i Secretly on GitHub, all of these things I mentioned are actually issues. And the pull request is a type of issue. So if you want to query the the issues, then you hit the issue endpoint with a pull request number and it gives you only the issue portion of the pull request.
00:18:47
Speaker
But this is the kind of annoying stuff that Trustfall is designed to shield you from. So if you have a GitHub integration, then you get to write queries being blissfully ignorant of the nasty schema hacks that have gone under the hood just to make all of this work and presumably get it implemented as an MVP. And then, oops, it caught on. Now it's too late to change the API. Right.
Using Trustful for Rust API checks
00:19:09
Speaker
yeah right And it's somebody else's responsibility under the hood to make sure that all of this is properly implemented and properly optimized and works well.
00:19:18
Speaker
and when When you say someone else, I'm i'm sort of wearing two hats here. I'm me writing my GraphQL-ish thing as a query author. and I don't have to worry about that, but when I go to make my GitLab plugin for Trustful, then I do have to worry about it. Yes, exactly. and Like you mentioned, it could be one person wearing multiple hats, but it could also be multiple different people. and I think that's an underrated part of Trustful. I think that is valuable in ways that are surprising to people.
00:19:48
Speaker
Give me an example. Surprise me. Come on. Yeah. So, uh, cargo sample checks. Another tool that I maintain. It's a semantic versioning linter for Rust. Okay. It's not the first attempt at doing this. It's been attempted several times in the past. And every time in the past, people have run into issues. The issues are there are way too many rules. The rules are very complicated. They require substantial static analysis to get correct answers. And obviously, people don't like a tool that gives them false positives. So they're upset if you don't get the answer right. And the underlying data sources that you use change more or less with every Rust release.
00:20:29
Speaker
Yeah, just to make sure we're on the same page. this is This is a tool that's going to say, if I bump the minor version of my library, but I've actually done a major API change, one of the function calls, this will shout at me. Exactly. yeah It will say, here's the thing that changed. It is broken in this way. You should bump major or not do this.
00:20:49
Speaker
That does feel like something that you would have to implement as part of Rust to get access to all the information. Right. And Rust will happily give you access to the underlying you know compiler APIs or what's called Rust doc JSON, which is a machine readable description of the API and things like that. But all these things are unstable, right? It doesn't make sense to lock them in when Rust is a Rapidly evolving language that is getting better every day and so locking them in would just tie the hands of those maintainers that that are working on all of this so it doesn't really make sense to say this is the format and we're never changing this.
00:21:27
Speaker
On the flip side, that means that whoever is writing these kinds of checks, it would be a massively unhappy experience if they have to maintain literal hundreds of these checks and then rewrite them once per Rust release, because that's just not going to work, is it? No. I can see that being abysmally painful. How distrustful, Silvert.
00:21:47
Speaker
Yes. Trustful solves it. um So, Cargo Samper Checks is a project that I've been working on for a little over two years, about two and a half years now. And it's the recipient of a grust of of a Rust Foundation grant. It's plans to merge into Cargo. It's used by top ah tech companies worldwide, you know, the companies like Amazon and Google and Microsoft and Cloudflare and so on, right? So, it's a serious it's a serious piece of software, right? It's not just a toy.
00:22:16
Speaker
The way that it works is that we have different adapters, different ah plugins for the different kinds of data formats that all translate everything into the same consistent trust-fault schema Because at a trustful level, Rust is a language that has functions that take parameters and structs that have fields and enums that have variants. And how all of that data is represented under the hood from in these compiler output formats, it doesn't really matter, right?
00:22:48
Speaker
Rust isn't going to stop having functions tomorrow, right? That's not the kind of breaking change that's going to happen. yeah The kind of breaking change that might happen is, hey, this field over here in this JSON object used to be a string, but we realized we need a few extra pieces of information. So now it's ah ah it's an object with some nested keys. And the previous piece of information that we had is now one of the keys in the object as opposed to being the top level thing.
00:23:14
Speaker
Right, so so are you saying you end up with um a few core data types that describe Rust rust um APIs in general, and then you have a different adapter for every breaking change on the Rust
Adapting to Rust API changes with Trustful
00:23:28
Speaker
API? Exactly. And so this means that the queries that we write, which look for breakage, things like, hey, a function used to be there, and it's no longer there. Or a function used to take three arguments, and now it takes four.
00:23:44
Speaker
Those are written against the abstract schema that describes Rust, that is the trust-fall schema, and how those queries get evaluated, not the tools problem.
00:23:56
Speaker
OK. Right. So it can get any number of different formats. At one point, I think we supported nine different formats that are all mutually incompatible. And we just asked Rust to put to give us a you know machine-readable representation of the API. It came back in one of those nine formats, depending on what which Rust version you used. We used the appropriate adapter to read that that format, and then ran hundreds of queries to find all the different kinds of breaking changes that the tool can find.
00:24:25
Speaker
Give me an example of a query, because I can't quite visualize a GraphQL query that tells me there's a minor breaking change. Yes, and this is where Trustfall is not like GraphQL, right? It's more expressive. okay So the simplest thing that you could possibly imagine is some top-level function used to exist and it no longer exists at the same path.
00:24:49
Speaker
so When thinking about breaking changes, a useful mindset to be in is how would I convince somebody that something is broken? and In this case, the answer is I will write an import statement and that will cause a compile error in one version of the library and it will work fine in the previous version. okay right so If I can do that, breaking change. What does the query look like for this? It looks like this. For the crate that we're looking at, look at the old version,
00:25:18
Speaker
So the one that is already published on crates.io that people are already using. Look at all of the public functions that are defined inside and look at all of the paths under which those functions can be imported. Then look at the new version that the user is proposing that they publish. Look at all the public functions and look for one that is at a matching import path. rightnd If you find no such function, you have found a breaking change.
00:25:49
Speaker
Because I can now, given the import path that I know works and will no longer work, I will just write use that import path semicolon. And that is an example of a Rust program that is broken under that ah delta between the old and new packages. You can tell me if this is my mind or or something in the library, but that seems to me to be a very relational database set-based operation. You're getting the, what's the correct term, this this junction of the two sets, right?
00:26:19
Speaker
It doesn't seem like a graphish query. Ah, but it is exactly a graphish query. It's just that graphs are equivalent to to relational ah queries in this case, right? So the way that we're representing it, we're choosing to say that importable path, for example, is an edge on function.
00:26:41
Speaker
okay right and It lists all of the importable paths with all their properties and stuff like that. right In a SQL database, we would call that a foreign key relationship. right It would be a join to some other table. The reason graphs are a little bit more convenient, though, is because importable paths are not something that inherently exists in the data that we are reading.
00:27:03
Speaker
It is something that we materialize on demand. It's almost like it's a view that we've defined that under the hood does static analysis. Right. Yeah. And so we can sort of shoehorn the relational model back into this, but it is a lot more convenient to say, you know, instead of I'm joining to something that doesn't exist,
00:27:28
Speaker
Right? To say, I'm resolving an edge, and there are some implementation details under the hood that give me whatever the importable paths are on the other side. Right? Yeah. And I'm able to resolve, I'm able to make a query that says, give me the unresolvable edges. Is that what's going on? In a sense, yes. i The adapter for Trustfall,
00:27:53
Speaker
When asked for importable paths, we'll compute them dynamically on the basis of doing static analysis over the package. right So it's not something that we've had to necessarily pre-compute or get directly from Rust, in part because that set actually cannot be computed. It could potentially be infinite in size.
00:28:12
Speaker
okay And Rust is possible for items to be importable under ah infinitely many paths. And so attempting to materialize an infinitely sized table would be a bit of a problem. yeah But if you can prove that you don't need to materialize an infinitely large number of things, then it's fine if that set is infinite so long as we never look at the infinity.
00:28:33
Speaker
Yeah. Yeah. Okay. So I'm still trying to get this straight, the query straight in my head and forgive me for laboring this, but I want to make sure I understand. So I'm writing some kind of GraphQL query that says, look at my Rust program, look at all the importable paths. And I'm trying to, how am I comparing that to the old version of my code? Yes. So that is the principle place where trustful differentiates itself from GraphQL.
00:29:03
Speaker
OK. The key operator here that GraphQL is missing is the ability to say this value over here and that value over there are equal. right They're both values in my query. they're not Neither of them is a variable. right Because I'm saying for all the paths of all the public functions, find the matching one, or if you fail, then yell. Yeah.
00:29:28
Speaker
right so The path is not a variable, it's something that we look up dynamically. In Trustful, because it's not just GraphQL, it has a bunch of extensions, the semantics are quite different, but can actually define ah we can essentially capture the value that we've encountered in a query.
00:29:48
Speaker
I've defined directives that are custom to Trustful that allow us to do these kinds of operations. So the directive to capture one of these values is called tag. So we will tag the importable path that we are attempting to find in the new version. We will tag all the importable paths that we can find in the original version of the package yeah for every public function. So we tag that value and then we go to the new version and we say ah apply a filter, so at filter with the operator equals using the tag that we've captured from elsewhere in the query to look for an importable path that matches the other place. And give you the ones that don't resolve.
00:30:32
Speaker
Yes. and Then you say, aggregate, like count up how many matching and portable paths there were in the new version of the package. Count how how large is that aggregate. Filter with equals zero. right yeah and That means that we have found all the ones that didn't match. Right. Okay. yeah Yeah. I'm with you now. That self-referential nature is key because we can tag values and then use them in a filter later.
00:31:00
Speaker
And then we can perform aggregations and transformation of that of that output to to essentially say this does not exist. Right. Yeah. Now I start to see how it's functionally equivalent to um to a relational query. Exactly. Yeah. Okay. And this is not something that GraphQL would natively support. I'm sure people have come up with all sorts of extensions, right? But vanilla GraphQL does not do this. And more importantly, vanilla GraphQL i requires that you include all of the results and return them fully nested.
00:31:34
Speaker
which texas us puts us at odds with the thing I said earlier about making queries as cheap as they could possibly be. right You can't really lazily evaluate a GraphQL query when it wants to return a fully nested result with all of the with the entire payload, with every single component that matches. Yeah. Trustful returns stuff like a database would. It returns a tabular result, which is ah essentially an iterator of rows.
00:32:00
Speaker
right So in our case, we can lazily evaluate it because we will get one row and then we will not get the next one unless the user asks. So we do flattening of the results. We do not fully nest things. If you want them nested, you can apply aggregations and that will re-nest things.
00:32:15
Speaker
By default, though, it's exactly equivalent to relational semantics. So when you traverse an edge in Trustful, it's exactly the same as writing join in SQL. When you apply filter in Trustful, it's like adding to a where clause. When you add an output ah directive to a Trustful query, it's exactly as adding to a select statement, and so on. So it's all one-to-one. It's just a little bit more convenient to talk about edges into things that don't exist until they're computed. you know They don't exist until until they're looked at.
00:32:44
Speaker
Yeah, yeah. I can also see is something about it seems to fit slightly better in my head when you're resolving edges to completely separate surfaces. Exactly. That feels... Yeah, somehow that feels more natural. Right, because you're not saying, oh, I promise to have a drawing key in this other service that has no idea where I came from.
00:33:04
Speaker
That seems a little bit strange. Yeah, yeah, I can see that. OK, so you're you're making me think, then, if we can um if we can query across the compile compilation data of a Rust project for different rules, have you got any designs on replacing Clippy to do like a linter for Rust?
Creating a Python linter with Trustful
00:33:29
Speaker
Hargest Enverchex is essentially a linter for Rust.
00:33:32
Speaker
um Replacing Clippy, I don't think so. um Clippy is very good at exactly the kinds of things that Congress EnverChes is not and vice versa. So, Congress EnverChes does not look at method bodies at all. but It only looks at sort of external API signature level things. Doing analysis inside method bodies can be done, but Clippy does it quite well and there's just kind of no point in duplicating effort.
00:34:05
Speaker
If you're interested in linting at a higher level, at the level of, you know, APIs and items in the crate, right? Like, what are the constants that you expose or something like that? By all means, you could build on top of what Trustfall and Carver Sandberg checks allow. And that is entirely plausible. If someone listening to this has a linter they've wanted to build and they haven't been able to, I would love to chat. I might be able to help.
00:34:30
Speaker
I haven't needed something like that myself just because car center for checks you know things that break inside method bodies are hopefully caught by your unit tests, right? Carver server checks aims to catch the things where you've updated the unit test and you haven't realized that there is some fundamental thing that gets broken, even though your test suite is passing.
00:34:52
Speaker
But what what I'm wondering, almost more than that, um more than the competition with Clippy is like, if you're saying I can query all the things and I look at, let's let's take Python as our victim, and that's an a i somewhere there's an abstract syntax tree in Python.
00:35:10
Speaker
the trees, graphs, very natural fit there. Could I start to write a linter for Python, in trustful, or for any language? If I can query all the things, can I query for bad patterns in LanguageX? Yes, absolutely. And in fact, I've built a prototype Semver linter for Python.
00:35:31
Speaker
o Which is a lot of fun, to be honest. um Finding breaking changes in a dynamic language like Python is obviously trickier than doing it in Rust, where the compiler will very happily tell you everything you need to know about the program. But this is one of the advantages of the trustful model.
00:35:50
Speaker
It requires some expertise to, like you said, do the program analysis and read the abstract syntax tree and decide you know what import paths are available for this item and stuff like that. And that is a box of expertise that certain people have. And then there's a box of expertise of, here are the bad patterns that I would like to catch. Right?
00:36:15
Speaker
There are some people that have both sets of expertise at once. There are a lot more people who have one or the other, but not necessarily both. With Trustfall, you can have the people that have the expertise for language analysis, write a plugin for a language, and then say, hey, here's a Trustfall schema. You can write queries over this, and you can build whatever links you want. And then somebody else can come in and say, oh, here's this bad pattern that caused an outage or broke my code or something like that. I want it to never happen again. They can write a query, and they can be blissfully ignorant of all of the magic static analysis that happened and under the hood to make their query run.
00:36:53
Speaker
Right, yeah, yeah, yeah, okay, I can see it. And more importantly, there's actually a third stakeholder here that is not obvious. Optimizing queries is very important if you're dealing with a large code base. The folks that know how to optimize database queries are not necessarily the program analysis people, and they're not necessarily the people that will write the lint. Yeah, yeah, yeah, okay.
00:37:15
Speaker
yeah We've got to get into optimization then, but before we before we leave this particular particular topic, I'm just thinking, if I actually wanted to write a linter, I see that my I could get trust for accessing the programming language as a data source. Maybe your variant of GraphQL isn't the ideal query language for expressing lint rules. Would I then just go and write some Rust that produces the same IR and be happy with that?
00:37:46
Speaker
ah It might be a bit of work, but in principle, yes, you could definitely do that. Although I would claim that having written three linters on top of Trustfall already, I think it's pretty decent at expressing lint rules.
00:38:00
Speaker
okay okay ah you know Being able to reference data in one place and and query for it, you know like filter on it somewhere else in the same query, plus recursion, plus aggregation, plus transformations, plus the freedom to arrange a schema in a way that is not necessarily tied directly to how the underlying you know JSON or program analysis works, but whatever is ergonomic for your queries, is actually shockingly powerful.
00:38:27
Speaker
Because I've definitely worked with third-party APIs where I'm thinking, what the heck were they thinking about this schema? um Being able to abstract that way definitely is a feature I'd want. right and and That's really the key here. right so An example from Python, right we can do static analysis to determine which items are importable at which paths, and that works quite well most of the time. But some of the time people do some fancy metaprogramming and it's not necessarily obvious what is importable from which module.
00:38:56
Speaker
Well, in an ideal world, we would just be able to say, hey, you know for any given Python module, what are the items inside? you know What are the functions and classes and things that that it defines? And as it turns out, in the worst case, we could always just define a fallback that just runs Python, you know fires up Python, runs import that module, runs dir that module, sees what comes out, and then says, here you go.
00:39:24
Speaker
But it's a hack when it might be a necessary hack and the fact that you'd be able to do it. Right. And so the the part that I think gives you a lot of freedom here is that you can model the schema to whatever the use case requires. And it doesn't necessarily have to be the case that there is a one-to-one mapping to the underlying JSON format or REST API or whatever you might be querying, right? So you design the schema that you wish you had, and then you figure out how to plumb in all of the different pieces of data that need to go there. And so long as you can pull that off, everything else is groovy.
00:39:59
Speaker
Okay. You've thrown me now by thinking of Groovy. I don't know anything particularly special we could query about Groovy compared to Python, but yeah. Okay, ah so you've got You've got data sources like the GitHub API. I know you've got file systems as a data source because that's just another graph tree you can query, right? is Are there any and there any particularly special data sources or any data sources you don't think would be a good fit for this?
00:40:31
Speaker
I've not figured out how to do a great job of heavily analytical queries over ah time series data. So if you're, I don't know, a quant trading shop and you have millisecond level stock price movements and you want to do something complicated over that,
00:40:49
Speaker
Trustfall might not be the best ah answer there, just because you can express queries the queries that you'll need, but they'll be a little bit clunky and they'll be a little bit slow. you know Those kinds of data sources benefit heavily from vectorization and things like that. And it's not necessarily the case that this can't be done in Trustfall, it's just that it's not very good at it right now. It hasn't really been a priority, so I haven't really thought very much about how to do it very well.
00:41:14
Speaker
But I will say I've plugged in some fascinating data sources, um including at my and my previous job.
Corporate use of Trustful for deployment
00:41:21
Speaker
I gave a talk at a conference a couple of years ago called How to Query Almost Everything. And a couple of the examples there were lints that are able to catch in a large corporate monorepo.
00:41:36
Speaker
cases where someone accidentally attempts to publish a Python project on an incompatible Docker image ah for that Python version that the project requires. So you have a Python project, it has a pyproject.toml, and that says requires Python 3.13.
00:41:55
Speaker
yeah That project then has some configuration in some company specific configuration language that selects a Kubernetes cluster and defines you know services and endpoints and you know all of the resources that are necessary. And eventually it picks a Docker image with which to deploy the ah service. And that Docker image is built from some base image that contains a version of Python.
00:42:21
Speaker
That version of Python should match whatever the version of Python is in pyproject.toml. As it turns out, it often doesn't. And as it turns out, this is the sort of thing that you often find in production at 3am.
00:42:36
Speaker
right Because Python will happily putter along and until the first time it finds some function that was introduced in 3.13 that wasn't there in 3.12. yeah Yeah, which could take potentially weeks for that to come up, right? Yeah. Right. So you're you're literally there, you're building a join query between the Python definition and the Dockerfile.
00:42:57
Speaker
Yes, and everything in between. So the configuration language, Kubernetes, yeah ah you know all of the monorepo layout, because a lot of the stuff is, you know if I put these magic files in this directory and those magic files in that directory, then those two things are related. And obviously, this is something that is written in some onboarding guide, right? But it's not necessarily top of mind. And certainly, a SQL query engine wouldn't know anything about this.
00:43:20
Speaker
yeah yeah and It's something you wouldn't think of as being a query, but one but with it clearly can be thought of that way if we had the tools to implement it. Exactly. yeah Another example would be, find Python packages in our internal package registry that ship a ah py dot.typed file. so They say, we have type hints here, you can use them, but have a CI configuration that opts out of running MyPy. Those type hints are not checked.
00:43:52
Speaker
Okay, yeah right yeah. And once again, this is a it's not a Python problem, right? The Python code by itself is fine. And it's not really a like package problem once it's uploaded to the package registry. It's a, your CI configuration is incompatible with what's inside the package manifest.
00:44:11
Speaker
So no single tool can let you know that this is a problem. Just like you know Python in the in the you know Docker image divergence problem, Python will be happy and Docker will be happy, but both of them put together will not be happy.
00:44:26
Speaker
Yeah, yeah. And it's like, it's that whole integration problem. We don't have a way of enforcing rules across integrations. Exactly. Yeah. And so this is the kind of use case for trustful shines because you build an integration for what's inside a Python packages manifest and you build an integration for what is the CI configuration. You build an integration for what's inside this Docker image. You build an integration for what's deployed in this Kubernetes cluster. And then you interconnect all of them and all the ways that, you know, they know things about each other.
00:44:56
Speaker
right Kubernetes knows it's this Docker image and that you know ah like that cluster or something like that. yeah And all of a sudden, Trustfall gives you the ability to express all of these really complicated things across the entire system as a whole.
00:45:10
Speaker
And that was the gap that the trust fall system was attempting to fulfill. The fact that you know DevOps or infrastructure people know the ways that people mess up programs. But if you ask them to start with program analysis of Python, nobody's going to be happy with that. It's going to just you know frustrate a lot of people. And they have 10,000 other things to be doing. And so that work is just not going to happen. yeah And we're going to keep finding these outages in production. And then everyone's going to be miserable. Yeah.
00:45:39
Speaker
Yeah, i'm you're making me think I could do something like ah look at this GitHub repo, find all the tags that begin with V, and tell me if they don't also exist as headings in changelog.md.
00:45:53
Speaker
ye Or even scarier, you could say, look at all of the ah releases on a package index, right like crates.io, for example. Look up the hash with which that release was published, because that's part of the package metadata that gets shipped to crates.io. And then check the corresponding GitHub repository and see if the tagged version that corresponds to that number matches that commit. oh yeah That is a scary query, let me tell you. The result is something you're not going to like. oh Yeah, I mean, you'd hope it would be the empty set, but I bet it's not. You really would hope, yes. oh gosh It is not the empty set, no.
00:46:37
Speaker
Okay, okay, I could start to see how you can have a lot of fun with this. um You made me wonder, um like for time series data, could would it be mad? Could I? Would it be insane to use an SQL database as my query source for trustful? Or some other query language? Absolutely, yes. So Trustful, just like it's pluggable on the front end, the query languages, you know if you can compile to IR, you're good. It's also pluggable on the back end. And in fact, at my previous job, we built a system for compiling Trustful IR directly into SQL and other databases query languages.
00:47:18
Speaker
okay the The rule of thumb essentially is trustful can do quite a good job of executing queries. But if you have something that has more information about the underlying data source, like a SQL database that has all sorts of statistics and knows where the indexes are and stuff like that, just let that database do its job.
00:47:39
Speaker
right and so You can translate the trustful IR into SQL, you could translate it into other query languages for other databases. We used a variety of graph databases, for example, that had different kinds of query languages. so You compile to that and then you ask the database to pretty please run that query.
00:47:57
Speaker
Interestingly, this because this is such a powerful query language and people were very happy to to write queries, a lot of the time this would lead to situations where they would ask for a lot of data. and What you don't want is to ask a SQL database to give you a billion rows worth of data all at the same time. yeah right so Some of the fun stuff that we ended up building was a layer between where people submit queries, essentially, and the underlying database, that would analyze the query, would figure out what are all the different data sources that are involved, since this SQL cluster over there and that S3 bucket over there, and so on and so on. It would slice it up into the individual components. Then for every individual component, it would analyze the cost and try to figure out if it's going to blow something up. yeah If it's expensive, it will, transparently to the user, slice up that query into smaller pages.
00:48:51
Speaker
and run them individually. So it will run the first page, it will get some results, and it will return upstream a bunch of results and a continuation query. So a trustful query that says, you know essentially, give me the rest of it. Yeah. So you're kind of paging internally. Exactly. So you're paging internally at the level of each individual service. And then at a higher level, the interface where the user submitted the query can glue all of these things together and send them back to the client. Right, yeah.
00:49:21
Speaker
So from the client's perspective, they just ran a billion row query and everything worked fine. From the underlying systems perspective, it ran a thousand rows worth of data and it returned it to the client and the client has desires to fetch some more rows maybe later. Yeah. Okay. That is the best of all rules. Yeah. Yeah. Yeah. So.
00:49:41
Speaker
Let me ask you a question about cost and optimization then between different data sources. You can easily imagine a query where you're joining the GitHub API to a local SQL database. And in almost all cases, I know that going to the local SQL database, driving the query from that and joining to the remote once I've got the data is probably going to be cheaper. but Do I have a way of expressing this is a cheap data source, this is an expensive data source?
Optimizing queries with schema control
00:50:11
Speaker
Yes, and it's not the way that you think. boom Good. So Trustfall in its present open source incarnation is relatively naive as to ordering of the clauses between the queries. I just haven't built an optimizer that just you know reorders things on the basis of cost yet. It hasn't been a priority. But the easiest way to control the cheap versus expensive is with the schema.
00:50:39
Speaker
So Truffle will run ah the joins, essentially, in lexical order, in the order that they appear in the query. So just design your schema such that the entry points start in the SQL database and not in the GitHub API. Oh, right. OK. So drive it from query land. But this this does move some kind of implementation detail to the query author. Not quite. It moves some of the work to the ah schema designer.
00:51:10
Speaker
Okay, so it's the it's not the query itself that defines the resolution order, but the schema definition. Well, the schema definition governs which queries are possible to write. And so if we just make the expensive queries not possible, we nudge people to writing the cheaper ones.
00:51:29
Speaker
OK, yeah, so yeah ah because this is relating back to this um top level entry point, you don't give people top level entry points to GitHub if you want them to go in via SQL. Yes, exactly. OK, now I'm with you. And again, this is a right now kind of limitation down the line. You could imagine all sorts of improvements that make this no longer be the case. Right. It's just that Trustful is a very bold project that has all sorts of surface area to be expanded in different places. And there are unfortunately only so many hours in the day, so I haven't gotten to it yet. Yeah, yeah. um I think all query engines start off with the user's responsibility for optimization and get to it's automatically optimized because the machine can do it better than a human. Absolutely. And Trustful is actually getting their
00:52:19
Speaker
as well. So I gave a talk, I want to say two years ago, on ah speeding up trustful queries for cargo sample checks. The TLDR of the talk is adding about 700 lines worth of code, gave a 2300 times plus speed up. So I turned a workload that took over five hours, made it take about seven seconds. right ah And no queries were affected whatsoever.
00:52:50
Speaker
No queries were harmed in the making of this optimization. Exactly, right. yeah Because Kargos Enverchex has hundreds of queries. Many of those queries are contributed by community members. They're not experts in trustfall. They're not experts in semantic versioning necessarily. They're not experts in static analysis. So we want them to be able to write queries without really worrying too much about how how do all the these pieces um fit under the hood. Trustfall has an optimization interface that I designed as part of this work.
00:53:19
Speaker
that essentially allows the adapter, those four functions that we talked about, to ask questions about what is the query intending to do with the data that it's asking for. okay right yeah Because this comes back to not knowing where is the data coming from. right If you know that it's on disk and you know which indexes exist, then you just use the index. Easy. But Trustful has no idea what indexes exist or what is even an index. yeah right So it requests some data. it's up to the adapter interface you know up to the adapter implementation to use the interface to speed up queries as much as possible.
00:53:58
Speaker
So going to the query that we were talking about earlier about finding functions that are no longer portable, you could imagine running it in a naive way, where for every function and every path at which it could be imported, you look at every other function in the you know new package, and every path that it can be imported against, and just you know compare every pair, which is going to be n squared, and it's going to be horrifically expensive for a large package, obviously. yeah right and It's not the only query like this. Most queries in Samver Checking find the equivalent item you know A to B across the the version gap. This is going to be a very common, very expensive pattern. You can imagine it would be a lot cheaper if when loading the functions in the portable paths, we somehow knew that we were looking for a specific path and not for all paths.
00:54:51
Speaker
ye right so The query writer had in mind some path that they're looking at, and if the engine could fetch that one out of the the bulk as opposed to iterating through everything, then life would be great. The challenge here is how do you make that happen without tying everything in a big knot?
00:55:12
Speaker
So like I mentioned, Trustful has an internal representation for queries. In principle, you could say, hey, adapter, here's the entire query internal representation. You deal with it. Figure out if there's an opportunity for optimization. Yeah. That is not going to be a very satisfying solution, though. No, it would make you wonder what the query engine is actually doing for you. Exactly. Like, what is your purpose, right?
00:55:36
Speaker
um And it's also unsatisfying because now we said, OK, the person who's writing the adapter also better be an expert in query engines. But the person who's writing an adapter is probably more of an expert in, say, static analysis right and figuring out like what paths exist and is a trait sealed in all of that kind of complexity. So I don't necessarily want to know about like predicate pushdown and like indexing optimizations and whatnot. right So it would be much nicer if we could instead say, as part of the adapter interface,
00:56:06
Speaker
Trustful allows the adapter to, while it's resolving one of these four functions, to ask questions about what is going on with the query. Like, for example, okay Trustful says, hey, please fetch all the public functions and their importable paths. right And the adapter says, well, hang on. You're asking for functions.
00:56:26
Speaker
You wanted in importable paths, right? Okay. Did you want a specific path in there? Like, did you have a concrete one in mind or should it just go ahead and get all of them? And Trustful says, well, yes. Actually, I wanted this specific one, you know, foo colon colon bar. That's what I'm looking for. Yeah. And the adapter says, oh, great. Found it. Here it is. That's it.
00:56:49
Speaker
Is this how, if we were querying a REST API, how you deal with the difference between I want to quick i want to get a single record sometimes and sometimes it's faster to get a whole page? Yes, exactly. okay So what I've just described here is called predicate pushdown. right We noticed that we're doing a bunch of joins and then filtering on the end result. But if we could apply the filter first and then sort of resolve the joins backwards, that saves us from a bunch of work.
00:57:19
Speaker
But notice that you know this kind of like asking questions API never mentions anything about indexes or predicate pushdown or anything like that. It just says, hey, did you have a path in mind? And if so, I can grab that one directly from some hash table as opposed to going through the for loops and for loops and ah complexity of of doing that, right? Yeah. And more importantly, it's not at all tied to the intermediate representation that Trustfall uses for queries.
00:57:45
Speaker
So I could change how Trustful queries are represented under the hood if it makes extra capabilities possible in future versions of Trustful, and it will not break your implementations of adapters for Rust APIs or for GitHub or for file systems or for anything else. There will just now be more information they could have queried back for.
00:58:06
Speaker
Exactly. OK. So queries that can query the query engine. Exactly. Think I've got it. It's all queries all the way down. You have one tool and you're using it thoroughly. OK, so this we have to talk about caching then in this because this is another big part of optimization. Absolutely. If I run a query against GitHub and then run the same query because I'm debugging it and I change it, am I going to do all that work again? Depends on your adapter implementation.
00:58:35
Speaker
Oh, okay, so you're pushing it to the adapter implementer. Right, because Trustful has no idea where the data is coming from, right? For all it knows, it's in some in-memory data source where caching doesn't make any sense. Or maybe it's somewhere very expensive like GitHub and some caching might make sense, right? So right now, it's the adapter's responsibility. With more time and hopefully more funding,
00:58:58
Speaker
You could very much imagine having some middleware where you write some naive adapter and then you wrap it in some middleware that applies caching because you know at an adapter level, you know GitHub is expensive, wrap it in the caching middleware, and my in-memory data source is not expensive, so don't wrap it in the caching middleware.
00:59:15
Speaker
Yeah, OK. Yeah, I can tell you see that. It makes me wonder at some point in the far future, we'll be using a like ah something like DuckDB or SQLite as your caching
Adapting to data format changes
00:59:27
Speaker
layer. And then it really is queryception. Right. The thought has occurred to me. Ultimately, I think the key observation with Trustful is this. Users care about running some query. They have some question they want answered over some set of data sources.
00:59:43
Speaker
Whether it's DuckDB, SQLite, a REST API, a bunch of JSON files, something that's warm in cache, or an intern with pen and pencil answering the query, doesn't matter. right So Trustfall is an attempt to abstract away how are we getting that data to you, including where is it coming from, what are all the formats, what are all the moving pieces, all the optimizations, and just get you the data that you requested. And it is my belief that that makes everyone happier.
01:00:12
Speaker
Because yeah if I'm a user writing queries, I want to not shut the door on future improvements because you know maybe GitHub uploads a data dump of all of the stuff that I might want that is available as, I don't know, a BigQuery instance or something like that that I can query that will do the job just as well. I don't want to be stuck querying the REST API one at a time if I can query something more intelligent and more bulk. As the query engine gets smarter, I want my queries to get faster and I don't necessarily want to do a lot of work about it.
01:00:44
Speaker
Yeah, yeah i can I've seen a few different ways of people are trying to ah mix in data streams with LLMs, and I can see joining to an LLM as a data source being a potentially very interesting use case to people. Absolutely. yeah Another interesting use case to me is sort of the the flip side of all of this. When people were initially attempting to build Semper Checking Tools for Rust,
01:01:10
Speaker
there was a decent bit of friction about, oh, like yes, this API is unstable, but if you change it in this way, it will break these tools, and now we will need to do a whole bunch of work just to keep the lights on, right just to keep the tool working as well today as it was yesterday. That's frustrating, and it's it has a chilling effect on the maintenance of those tools downstream, because nobody wants to build stuff knowing that they're breaking somebody down downstream of them, and they're opening up a bunch of extra work for them.
01:01:38
Speaker
yeah With Carbon Samber Checks being built on top of Trustful, I actively encourage the folks that are building Rustdoc and RustC, the compiler, into making changes, breaking changes to the format with which I'm getting the data, if it means that the format is getting better and we're getting more data that we can lint and find breaking changes in. Because it's very cheap for me to keep the tools working. you know they They make a new format,
01:02:05
Speaker
I make a new version of the adapter, I run cargo check, it finds all the places that are now broken, I tweak the two things that they've changed, and now I have a new version that works for everything. right It's five minutes worth of work, and I have 95 percent of that automated anyway. Yeah, and you never have to change the queries. Exactly. Yeah. I never have to change the queries. and As a result of this, the folks working on this, I think are a lot happier because they know They don't have to worry about it. If they have an idea for an improvement, they don't have to sit on that idea for a long time and try to collect a bunch of those ideas and only then make drastic changes all at the same time, you know bundled together, and then break everyone all at once, and then everyone catches up, and then everything is frozen in amber for another year until the next opportunity to break everything. yeah No, you know break everything every other week. I don't care.
01:02:57
Speaker
If you keep the cost of change cheap, then change actually happens. Exactly. yeah yeah So it's this decoupling that I think makes everyone happier. We can make things faster, we can make things better, and we can add extra queries, and all of these things can happen separately from everything else.
01:03:13
Speaker
okay so but some From some of the stuff we've said, this seems to lean more to the kind of programmer tooling space. But to pull back out, are people using Trustful for for like more business-y queries? At my last job, we definitely were. So Trustful happened sort of by accident. We were building a knowledge graph database. Graph databases circa 2015 were an extremely immature product.
01:03:44
Speaker
you know, all sorts of performance bugs, correctness bugs, all sorts of things like that. And we wanted to have a tremendous amount of flexibility in our product because it was key to our value proposition.
Trustful in business: Abstracting data and creating endpoints
01:03:55
Speaker
So long story short, I ended up being everyone's, can you please write me a query as a service?
01:04:03
Speaker
That is not necessarily a role role that I enjoyed. I didn't think it was you know using the best of my abilities because I had just learned a dozen or so rules about how to not blow my own foot off with ah database queries. so I wrote a tool that can translate from a GraphQL-ish syntax into database queries that are safe, that dodge all of the issues that we knew about in the databases. right yeah Pretty soon,
01:04:30
Speaker
The graph database that we were building was not the only database that that we cared about. We had SQL databases that had some other data. We had some stuff in S3 that was you know like video files, audio files, things like that, which obviously is not it's not the best idea to just shove it in a SQL database if it's a gigantic binary blob. No, absolutely not.
01:04:52
Speaker
But we had all the same needs. And in fact, we had more needs. We had you know machine learning teams that wanted to do interesting machine learning, but they weren't experts in the 30 different database technologies that we were using under the hood. We had people building all sorts of products on top of this that had needs like ah you know fetch this piece of data from over there and that piece of data over there. And I don't know if you've ever worked with financial data, but every company has 16 different names depending on what exactly you care about. you know and It has you know the legal name and what everyone calls it and what the security is called and the like international safe name versus the local name you know if it's like a Chinese company or something. yeah so Someone at a product level wants to make a decision of when we say company name, we mean this one. so We needed some tool to massage all of these pieces of data into something that is
01:05:43
Speaker
more ergonomic and less dangerous than just, here's 30 different databases, have at it, have fun, right? yeah So Trustfall sort of came out of that world where we could just say, hey, we have seven different definitions of what a company name is. For the specific use case for this product, we care about only this one and not the other six. We will delete the other six from the Trustfall scheme. They're physically not accessible. So if you want the company name,
01:06:13
Speaker
There's a company vertex, there's a name property, and that's it. Under the hood, it's not called name, it's called something else, and it's in some database, and there's a bunch of complexity, and there might be three joins involved in getting it. yeah Not your problem. From your perspective, it's a property and a vertex, you just get it, and you're done. Yeah, yeah, yeah, abstracting away that data complexity problem. Yeah, I can see that, I really can.
01:06:38
Speaker
And so this led to to a very nice feedback loop of, you know, we could send business development ah folks, you know, essentially like engineers who are also in the business side to customer sites, we would have conversations like,
01:06:54
Speaker
Hey, like what else do you want from the data set? Can we provide it? How do we make it work? right They get all of this information from our customers. They come back to the office in the afternoon. They sit down. They write some trustful queries. They check in those trustful queries into the specific magical place in the repository that we had such that every file in that ah place in the repository it becomes a new endpoint at a specified path.
01:07:20
Speaker
and Then they write up an e-mail, queue it up for the next morning and you know to the customer, and they say, hey, thank you so much for taking the time to meet yesterday. By the way, all of the things that you requested, we have shipped new endpoints, including exactly the data that you requested. Here it is. They already work with your API credentials. You're good.
01:07:37
Speaker
and the fact that they didn't have to worry about how to glue 30 different databases and how to not blow up anything along the way and all of the magic query execution, you know caching, predicate, pushdown, batching, magic would just happen for them was absolutely lovely. They liked it and we on the infrastructure team also liked it because We didn't have to worry about, oh, why is somebody fetching a billion rows? Because the answer a lot of the time is because they didn't know better, or they didn't really think about it, or they didn't really realize that the queries are that big. In our case, the computer realized the query was going to be expensive, and so it did something intelligent to not blow up the cluster. yeah yeah So everyone was really happier in that case, and our company was able to move very quickly.
01:08:20
Speaker
okay Okay, so the next time I'm in a situation where I'm thinking about importing the data I need into a database just so I can query it, I would definitely have trustful on my list. I can promise you that. um How do I get started querying the weird a data source that you've... I mean, practically, the weird data source that you've never encountered. I need to write these four functions, but what's my actual job? I sit down and start a new cargo project.
01:08:49
Speaker
ah Yes, if you want to write it in Rust, ah you can start a new cargo project. ah You would add trustfall as a dependency, and then you would step away from the computer and ideally go to a whiteboard. okay The most important thing to figure out is what are the queries that you want to do and how is the what is going to be the best way to represent the data to make those queries easy.
01:09:15
Speaker
right okay Most people, when they think of, you know I want to query a dataset, want to make a one-to-one correspondence between my JSON file or my REST API and the trustful schema, and then they go, wow, writing trustful queries is really hard.
01:09:33
Speaker
But most of the time, your queries are looking for some specific patterns. And the data representation that you're getting from whatever your data source is, is not necessarily ideally suited to those kinds of queries. So the best thing that you can do is think about what are the queries that you want, and how you can make the most ergonomic possible schema for yourself, because that's the thing that you're going to be interacting with on a regular basis.
01:09:59
Speaker
Right, so you're saying don't drive from the data source up, drive from the API ah you want down. Exactly, right. Start with the wish list and try to get somewhere that the data source can can provide as opposed to vice versa. Okay. At the end of this process, you have a hypothetical schema that you haven't implemented yet, and you have some queries that are valid over that schema that fulfill your use case. It's now time to to dive back into code.
01:10:28
Speaker
there is a tool in the trustful repository called ah trustful stubgen. It can take a schema and it will generate all of the boilerplate that you need to write a functional adapter. So it will produce implementations for those four functions and it will give you a bunch of functions with to-do placeholders inside them to you know for resolving all of the edges, for example, for resolving all the entry points and so on and so on.
01:10:53
Speaker
It will also add a unit test that does not pass until you've successfully implemented everything until there are no to-dos left in the code base. Right. And then you essentially just play to-do whack-a-mole with your data source. Right, yeah, yeah. Yeah, OK, which hopefully I've already got some Rust library to help me deal with. Yeah, that makes sense. And so most often what is the case is you have a Rust library that has 30 definitions for your data source, for example.
01:11:19
Speaker
right and You just need to say, okay I have a vertex enum for trustful purposes. It has variants for the different types of vertices it could be. so In the GitHub example, would be like it could be a repository, it could be a commit, it could be a release, it could be a user, so on and so on. You have a variant in trustful for each of these on some enum vertex type that you've defined for trustful adapter purposes. yep right so That's the vertex that gets passed around.
01:11:48
Speaker
And then you say, okay, I want to resolve the name of this user profile. Okay. That exists. You know, I know that the vertex must be a user variant or else something has gone horribly wrong in the query. And the user variant inside has the 30 definition type for a user. Right. And that has some fields that gives me the username. Okay, great. Get that field, return it to trustful.
01:12:16
Speaker
That sounds straightforward. Yeah, it's because, again, limited time, I have not managed to write as much documentation about this process as I would have liked. So what I usually recommend to people is work with me on making the adapter that a cargo server checks uses.
01:12:37
Speaker
So I can help you write a query to get familiarized with the query language. And then I can point you to a place where the schema needs extending, where some piece of information that will be useful for Semver is not yet implemented. And I can help you sort of learn by doing where okay yeah you design a new piece of schema for the new piece of data. And then you go and implement that in the adapter. And by that process, you will learn how all of the different helpers work. And I will happily code review your code and give you pointers in the right direction.
01:13:07
Speaker
ah Obviously, plenty of people have managed to do you know just fine without doing this, but if you're the kind of person that really likes to have like a step-by-step guide, I'd recommend you know the mentorship route as opposed to the the you know just brute force that against an opaque blob of code. And then write up your experiences afterwards so you've got that documentation.
01:13:30
Speaker
Exactly. That would be phenomenal. ah Because you know trust follows a query engine at the end of the day. I've done my best to make the APIs as simple and as easy as possible. But there's still a minimum level of you know complexity when you're dealing with queries and you have optimization concerns and you know everything needs to be lazily evaluated. So it's like you know deep on iterators. And if you're not necessarily super familiar with those concepts, then it can come off quite intimidating.
01:13:57
Speaker
At a conceptual level, it's less scary than what it looks like, but it does take a little bit of perseverance to just push past the unfamiliarity of being in this environment that you're not necessarily super familiar with. OK. I'm going to have to think of something exotic to go and query and come back and badger you. Please do. Awesome. But for now, thank you very much. I'm going to go into the data world and start thinking of different of queries differently. And as maybe as my database is a less center of the world thing than previously I thought. Interesting. I'm happy to try to keep changing people's minds about everything being a database if you're bold enough to query it. and So thank you so much for having me, and it was a pleasure to chat. Well, thank you for coming to answer my questions, because everything is a query. Great drags. Cheers.
01:14:49
Speaker
Thank you, Predrag. If you're curious to write some queries, there is a link to the Trustforce source code in the show notes, as you would expect if you're a regular listener to this podcast. Everything's in the show notes. And if you're not a regular listener, why not hit subscribe and become one?
01:15:05
Speaker
Either way, if you've enjoyed this episode, please take a moment to like it, rate it, share it with a friend or a social network, and I'll catch you for another episode soon. I've been your host, Chris Jenkins. This has been Developer Voices with Predrag Grevsky. Thanks for listening.