Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Semantic Search: A Deep Dive Into Vector Databases (with Zain Hasan) image

Semantic Search: A Deep Dive Into Vector Databases (with Zain Hasan)

Developer Voices
Avatar
1.3k Plays1 year ago

As interesting and useful as LLMs (Large Language Models) are proving, they have a severe limitation: they only know about the information they were trained on. If you train it on a snapshot of the internet from 2023, it’ll think it’s 2023 forever. So what do you do if you want to teach it some new information, but don’t want to burn a million AWS credits to get there?

In exploring that answer, we dive deep into the world of semantic search, augmented LLMs, and exactly how vector databases bridge that gap from the old dog to the new tricks. Along the way we’ll go from an easy trick to teach ChatGPT some new information by hand, all the way down to how vector databases store documents by their meaning, and how they efficiently search through those meanings to give custom, relevant answers to your questions.

--

Zain on Twitter: https://twitter.com/zainhasan6
Zain on LinkedIn: https://www.linkedin.com/in/zainhas
Kris on Twitter: https://twitter.com/krisajenkins
Kris on LinkedIn: https://www.linkedin.com/in/krisjenkins/
HNSW Paper: https://arxiv.org/abs/1603.09320
ImageBind - One Embedding Space To Bind Them All (pdf): https://openaccess.thecvf.com/content/CVPR2023/papers/Girdhar_ImageBind_One_Embedding_Space_To_Bind_Them_All_CVPR_2023_paper.pdf
Weaviate: https://weaviate.io/
Source: https://github.com/weaviate/weaviate
Examples: https://github.com/weaviate/weaviate-examples
Community Links: https://forum.weaviate.io/ and https://weaviate.io/slack

--

#vectordb #vectordatabase #semanticsearch #openai #chatgpt #weaviate #knn

Recommended
Transcript

Introduction to Vector and Semantic Search

00:00:00
Speaker
Vector search and semantic search are today's topics on developer voices.

Integrating Language Models with Datasets

00:00:05
Speaker
How do you take a large language model, which is all the rage at the moment, and teach it about your data set? What does it even mean to take a large language model and get it to search through the meaning of your code base or your documentation or your product catalog or whatever data you're dealing with?

Updating Language Models with New Knowledge

00:00:24
Speaker
How do you teach a computer to understand it?
00:00:27
Speaker
It seems really hard because large language models seem to come pre-baked from the factory as these trained, fixed things. How do you teach it new stuff?
00:00:38
Speaker
This is one episode where we get to go delightfully deep on how this all actually works. What's a large language model really doing? What does it need an auxiliary database for? And if it does, what's the flow of data back and forth between them? What is that auxiliary database actually doing? What's it doing in the pipeline, but also how is it doing it? What are the data structures? How does it organize data in memory on disk?
00:01:07
Speaker
What does it mean to search through meanings? How much work is it? How do you make it fast? How do you make it cheap enough to be used by your users? I'm interested in all of this stuff, and I really wanted to find someone who could take me right into the guts of what's going on here. And

Expert Insights with Zhan Hasan

00:01:24
Speaker
I think I lucked out with today's guest, Zhan Hasan. He has both the depth of knowledge to go right down into the internals,
00:01:32
Speaker
and the clarity of explanation to really bring it to life and make it make sense to me now. It really does. And I want to share that with you. We go all the way from clues about how you can improve your prompt engineering through index design and computational complexity and optimization to where Zan thinks the future of semantic search is headed.
00:01:55
Speaker
I can promise you, just from the first five minutes of this interview, you'll have a much better understanding of how this is all put together. So let's get going.

Vector Databases and Large Language Models

00:02:03
Speaker
I'm your host, Chris Jenkins. This is Developer Voices, and today's voice is Zan Hasan.
00:02:22
Speaker
Vector databases is the topic today, and we've got Zhan Hasan as our expert to talk about it. How you doing, Zhan? Everybody. I'm doing well. I'm doing well. Good. Good. You've just got back from filming a course for LinkedIn. So you're now a Hollywood star, right? Yeah, exactly. Makeup and everything. In fact, I did my makeup before this, so. You're looking fabulous. So we should probably jump out of the unfamiliar world of Hollywood and straight into the tech as fast as we possibly can.
00:02:52
Speaker
We want to talk, I wanted you to tell me about vector databases in as much depth as I can extract from you. But we should probably start on the common understanding of the state of ML, right? I think everybody has chat GPT as a reference point, right? Even if they haven't looked into this much.
00:03:14
Speaker
And I have this impression of chat GPT as a pre-baked neural net, and you can ask questions of it and it can answer because neural nets are magic, but it's pre-baked, so I can't add any data to it. I have a vague sense that that's where needing a database comes in. Take me to a proper understanding.
00:03:38
Speaker
Yeah, so if we zoom out a bit and we think about what chatgpt is, chatgpt is a chatbot that's built on top of a base model. And the base model itself is just a sentence completion tool. You can almost think of it as a fill in the blanks tool. So it's studied, essentially, pretty much all of the data on the internet to learn about
00:04:03
Speaker
the co-occurrence of words and which words are more likely to be used with other words so if you say something like i don't know the the monkey eight dash,
00:04:17
Speaker
Banana would be a higher probability completion compared to cabbage or something like that or car. That would be a very low probability completion. And so on top of that, you can build chatbots. So you can actually force a base model to act like a chatbot if you just prime it correctly. So you can say,
00:04:36
Speaker
Speaker one said XYZ chatbot answered speaker one question XYZ chatbot answered and then you the question that you wanted to actually answer you say speaker one. Ask the question what's the color of the sky and then chatbot and you leave it empty and then it'll auto complete based on the high the highest probability tokens that has been trained on. The chat GPT is just a finer and refined version of this where.
00:05:03
Speaker
we control the quality of its generation by fine tuning it on higher quality data points that have been extracted by contractors. So you give contractors a question and then they generate high quality answers and you train the base model to output those higher quality answers.
00:05:24
Speaker
All this to say, even if you take chat GPT and all of its fine tune versions, the issue is that it doesn't know what it doesn't know.

Databases in Retrieval Augmented Generation

00:05:33
Speaker
It only knows what it's been trained on. So it knows the probability of words or it knows the concepts that it's been trained on. Where a database comes into this is
00:05:45
Speaker
If you want to provide external knowledge to the system, whether that's because the data wasn't available at training time, so it's future data that's occurred after the training process, or more realistically, it could be data that's private. So if I'm a company, I have proprietary data.
00:06:07
Speaker
And I want to use the reasoning power or the generative power of a large language model on this data. I'm not going to send this data over to a third party company that's going to train the large language model. That's my private data, but I still want to be able to reason over this data.
00:06:25
Speaker
And that's where the database comes in. You can run the database locally. That database can have your private data. And then you attach that database to the large language model as almost like a secondary
00:06:39
Speaker
bank of information or it represents state for the large language model, really. And it can then retrieve information in real time before it generates. So it gives it the ability to read context, ground its answer in that context, and then generate as a result of that. So that's the high level picture. A lot of people call this retrieval augmented generation because you're retrieving context and then you're augmenting the generation with that retrieved context.
00:07:09
Speaker
And you can use any database for this. A vector database naturally fits in, because as we'll talk about later, you can query a vector database using natural language. You can talk to a vector database. So it's easier to get a large language model to query the database and then retrieve context from it. So

Operation of Vector Databases

00:07:30
Speaker
that's the larger picture of how the database fits in. It's essentially just a knowledge store.
00:07:35
Speaker
OK, take me one level deeper on that. When I type in my query, what's the flow of data through the large language model and through the vector database? And how is it recombined to do something? Yeah. So retrieval augmented generation, you can implement it in multiple ways. And there's different complexities of it. But the simplest way that you can implement retrieval augmented generation is think of going to chat TPT and asking a question.
00:08:03
Speaker
you can take that question and you can turn it into a query for the vector database. So let's say your question is,
00:08:12
Speaker
what type of condiments go well with a hamburger, right? So you can take that question, and rather than send it off to the large language model immediately, you turn it into a query for a vector database. That query can itself be the question. You send it to the vector database, and now you ask the vector database, retrieve for me
00:08:34
Speaker
the five most relevant documents that I have in my knowledge store that are to do with this question. So in this question, I've got condiments, I've got a burger, I've got ham. So it's going to retrieve for me things that are related to those concepts.
00:08:50
Speaker
And then those five things that come back can then be stuffed into my prompt. And we can say, the question here is useful information that you might find relevant. And that's what you send off to the large language model to generate with. OK. So I could literally do this manually, open one window with chat GPT. I go and query for five web documents related to my search term. And then I mash all of those into it and prompt, now answer my question.
00:09:16
Speaker
Yeah, exactly. In fact, before Retrieval Augmented Generation was popular, I was at a meetup here, and this was back in March of 2023, so chat GPT was taking off and people were hearing about it.
00:09:31
Speaker
Back then, a lot of people were doing was they would search over their PDF and they would say, okay, this chunk of text is what's relevant to this question. Type out the question and then copy paste the huge chunk of text, dump it into your chatbot, and then say, answer the question. Here's relevant information you might need to know. That's basically what's happening, but the problem there is scalability. What if you have a billion documents or hundreds of millions of documents and then
00:10:00
Speaker
You can't really do that, but that's where the vector database comes in. I see. I feel like you've just given away a magician's secret.
00:10:14
Speaker
OK, so then I get how that works then. We'll keep DLLM as our magic box of neural networks. Let's take a look at the vector database side. When I type in which condiments go well with a hamburger, what exactly is happening to that sentence? Yeah, so this is a little bit of a look behind the scenes of why it's called a vector database. So anytime you query a vector database,
00:10:43
Speaker
A vector database understands vectors. So it understands groups of numbers. Now, whether there are three groups of numbers or a thousand groups of numbers, that's determined by an underlying machine learning model. But the whole idea here is that if I type in what condiments go well with a hamburger, that is a human understandable version of a sentence, right? So I can ask people that and I can get coherent responses back. But if I ask that to a computer,
00:11:12
Speaker
It has no idea what these words mean. The only way that it understands meaning is if I capture this question or this sentence in numbers.
00:11:26
Speaker
or groups of numbers. So we want to give every word or every token here an ID. And then we want to analyze which words here from my training set commonly occur with other words and
00:11:43
Speaker
I need to be able to transfer the concepts in this human understandable version of a question to a computer or machine understandable version of a question. So essentially I need to go from this sentence to now a vector representation of that sentence.
00:11:58
Speaker
So, is my intuition right here that you're going to end up with a vector that, very broadly, it's a vector of floating point numbers, and the first one is the probability that we're talking about cats, and the second is the probability that we're talking about fast food, and the third is the probability that we're talking about Germany.
00:12:19
Speaker
So the first part I think is correct, but the second part is not, so the, when you capture, when you capture the sentence as vectors, it's not clear what the individual dimensions are. It could be that the first one is we're talking about cats. What's the probability we're talking about cats, but we don't really know what the latent space is represented. We don't know what each dimension actually means.
00:12:45
Speaker
So this is the whole point of optimizing a neural network. You initialize these weights randomly and then you say, I want you to correctly predict the next word and optimize all of these millions or billions of weights appropriately so that I can
00:13:03
Speaker
predict the next word better. And you get better and better as you optimize them. So we're not really sure whether the first dimension is, are we talking about cats? Second dimension, are we talking about Germany? It's whatever optimized the answer. It's whatever got us the lowest loss. Right. So the vectors are still sort of a black box, but I would expect two paragraphs that talk about burgers to have some similar subsets of vectors. Is it something like that?
00:13:33
Speaker
Exactly. I almost think of vectors as a barcode. Let's say you have a 100-dimensional vector. There's just 100 floating points in a NumPy array, let's say. And the higher the vector,
00:13:48
Speaker
Let's say it's more black. And if the lower the vector, let's say it's white. So now you can almost think of 100 dimensional barcode where you have a color. It's a strip for every number. And the higher the vector, the darker the color. Whereas the lower the vector, you get a white color. So now you essentially have a barcode.
00:14:10
Speaker
So the barcode for a sentence that's talking about a burger is going to light up in different places, whereas a barcode for a sentence that's talking about a cat is going to light up in different places. And you can actually calculate these barcodes and say, OK, this is a sentence about a cat. This is a sentence about a dog. And this is a sentence about a burger. The cat and the dog barcode would be a lot similar than the burger and the cat-dog barcode. Right.
00:14:39
Speaker
Yeah, and if you've got a barcode with all three of those lit up, you're in a really dodgy restaurant. Exactly. So that's the whole idea behind Vector Search. You're comparing these barcodes and you're saying, well, how close is this barcode to this other barcode? And the idea behind the barcode is that it's just some semantic, it's a capture of the semantics behind the human understandable version of the data.
00:15:03
Speaker
Okay, so I still need some kind of model that's going to turn my data, whether it's text or a PDF or an image, into a vector.
00:15:15
Speaker
Exactly. That's the key point here. The reason why they're called AI native or machine learning databases or AI first databases, there's a lot of buzzwords. But the reason why vector databases are affiliated with machine learning and they're kind of interwoven with machine learning is because they search over
00:15:36
Speaker
these vectors that are spit out and generated by ML models. And most of the times, they're neural networks. So that's why semantic search is also known as neural search sometimes.
00:15:46
Speaker
Right. Okay. So I, okay. I understand up to that point. So then we took, what we're really talking about underneath all this, once you finished neural networking things is a database that's good at storing vectors of floats and then searching for similarity between vectors of floats. Yeah. So how does that work?
00:16:13
Speaker
So on a high level, intuitively understanding it, it's what we talked about. If I give you
00:16:20
Speaker
a barcode for a question, your job is to say which barcode, let's say every single object you have, let's say every file you have on your computer, can be captured into a barcode. So that includes text documents, but interestingly also images, audio files, video files, and we can take that and we can
00:16:45
Speaker
Talk about multimodality later, but let's just say we only have text documents for now. And

Challenges and Solutions in Vector Searches

00:16:49
Speaker
we have the ability to turn every text document into a vector or a barcode. Vector search or semantic search or neural search is effectively saying, if I have this question and I have the barcode for this question, what are the five most similar barcodes or vectors to this question vector or question barcode? Intuitively, that's what's happening.
00:17:13
Speaker
One level deeper, what's happening is, because you have quantified everything as a vector, you can actually take the distance between these vectors using multiple different metrics. The easiest one, let's say, is Euclidean distance, where you can actually measure the shortest distance between this vector and this vector.
00:17:34
Speaker
If you had three dimensional vectors or two dimensional vectors, you could actually plot them out on a grid and you could measure the shortest, the direct line from one vector to the other vector. Okay. I remember enough Pythagoras to calculate the distance between X, Y points. And I guess you just scale it up, right?
00:17:50
Speaker
Exactly. So for multidimensional vectors, you can just take every dimension, subtract the other corresponding dimension, and then you can put it into the Euclidean distance formula, and you get a measure of how different two vectors are. And that's just one distance metric. There's lots of other distance metrics that we can choose from, but that's the main idea. So now we can take this concept of a barcode, and we can quantify how different one barcode or one vector is from another vector.
00:18:18
Speaker
And essentially, what you're doing is quantifying how different the question vector is from every other vector that you have in your database. And then you're saying, now that I have a distance between this question and every other vector, I'm going to organize them from smallest distance to highest distance. And I'm going to cut off at the top K, let's say the top five. And these five objects are the ones that are the closest or the most semantically similar to what my user is interested in. And then you return those.
00:18:46
Speaker
Okay. So I can almost literally give you the ballpark. Exactly. Like I put your 10,000 documents into a imaginary space and I can say which other documents are near this one in space. Exactly. Exactly. And actually your query is kind of the 10,001st document, isn't it?
00:19:05
Speaker
Yeah, so the query goes through the exact same pipeline because the query needs to be translated using the exact same machine learning model into vector space. And that vector space has to be the same vector space. Otherwise, it's like all of your documents are stored in German and then you're asking a question in English. And if the system doesn't understand both languages, you won't be able to extract relevant vectors.
00:19:30
Speaker
the vector language or the embedding space that we're talking about has to be generated by the same model. Okay, so I'm going to push you a bit deeper on that. So I can imagine if I actually want to use this in anger, maybe at work. I guess we're talking between 10,000 and 100,000 documents, that kind of order of magnitude, as soon as we get anything really interesting.
00:19:55
Speaker
I know enough from game programming that calculating collision detection on 100,000 objects is horribly slow. So how are you going to make this efficient calculating the distance between all these documents?
00:20:08
Speaker
Yeah, so that's the question that gets to the bread and butter of every vector database. If you think about vector search as I have some question vector and I have 10,000 object vectors or data points that I'm interested in and I want to retrieve the five most similar data points, you can't actually perform this k-nearest neighbors, this brute force k-nearest neighbors algorithm.
00:20:37
Speaker
Because, well, at a scale of 10,000 objects, you probably could. At 100,000 objects, now you're slowing down. At a million objects, you're slowing down even further. And the reason why is because let's say you have a question vector. And in order to find out which five objects it is closest to, you have to calculate the distance between this question vector and all 10,000 objects. And then you have to sort them from lowest distance to highest distance.
00:21:05
Speaker
So that is a complexity, a runtime complexity of N, where N is the number of objects that you've stored in here. So that's a big O of N, let's say. So as that N goes from 10,000 to 100,000,
00:21:20
Speaker
you get a 10x slow as that goes from 100,000 to a million that you get another 10x slow. The other problem here is that not only that, but because the Euclidean distance calculation is also a function of how many dimensions the vectors are.
00:21:38
Speaker
You also have that component in the runtime. So if you have a vector of 10 dimensions, then you have a runtime of 10 times the number of objects. If now your vector is more interesting, if it's capturing a lot more
00:21:53
Speaker
Lot more features if it's capturing more concepts now it might be a thousand dimensional so now your run time scales up by the dimensionality of each vector times the number of total vectors so it's a it's an m n right.
00:22:10
Speaker
Yes, exactly. So you get this kind of explosion where the more interesting data types you want to search over, you'll need bigger vectors. And the more of that type of data you have, you'll have more vectors in total. So now you're really slowing down. So what you need to do instead of brute force... Just to make that concrete before you go into how to optimize that, what sort of numbers are we typically talking about? What's an ordinary number of documents and an ordinary number of vector components?
00:22:38
Speaker
Yeah, so if you look at the dimensionality of vectors, commonly we get anything from 1,000 dimensional to 2,000 dimensional. Some of them are about 700 dimensional. But in that 1,000 dimensional ballpark, I would say is average. On the higher end, there's also models that generate 4,000 dimensional vectors. And a little note on this, I guess, as we have models that are
00:23:05
Speaker
multimodal, and that can capture all sorts of concepts, I imagine we'll have larger and larger dimensionality is growing as a model needs to understand different concepts, not just text documents, but also videos or images, the dimensionality will only grow. So we're starting off with 1000 2000. But 10,000 is not not out of the question. Okay, in your future.
00:23:28
Speaker
So that order of magnitude multiplied by what kind of document? The number of objects. So the number of objects usually, if we're talking about social media applications, if you're thinking about Facebook, Twitter, Instagram, you can have objects in the trillions easily.
00:23:47
Speaker
If you scale it down a little bit, let's say you're talking about recommender systems. Netflix, for example, has hundreds of millions of users. It has a catalog of, let's say, 20,000 unique TV shows, movies all around the world. So if you take that, you can easily get up to the billions or tens of billions of documents.
00:24:10
Speaker
Realistically, a lot of people are using this. So right now, a lot of people are trying proof of concept. There's very few companies that are at scale moving into using vector databases.
00:24:22
Speaker
But we've tested with a billion documents officially. And then there's users that have tested it with even more. So I would say even three to four billion. And that's the upper end that we're talking about. And of course, that's changing day by day. But if I had to give you a ballpark, I would probably say 100 million to a billion. If you're on the lower end, probably 10, 20 million documents.
00:24:48
Speaker
OK, that's got me firmly convinced. Thousands multiplied by billions, that's firmly into the place where we need to optimize. So you can be an idea. My fictional service has been very successful, and I've got 100 million users now. And I would like to say, who are users who buy stuff similar to ZAN? Because I want to recommend stuff to ZAN. How are you going to make that work fast?
00:25:18
Speaker
Yeah. So the trick here is not to do brute force K-nearest neighbor search. It's impossible to scale up when you have a runtime complexity of D dimensionality multiplied by N, where D is a thousand and N is a hundred million. There's no way you can scale that up.
00:25:38
Speaker
in real time. So you imagine having an app, somebody clicks a product, and then you trying to retrieve the 10 nearest objects to that the person would be sitting there for days. So the idea is to do approximate nearest neighbors. And this is what all vector databases really
00:25:58
Speaker
And so the idea is that instead of at a high level, you want to give up accuracy for performance. So you're going to say, maybe I won't be able to retrieve the best K nearest neighbors. Sometimes I'll miss a few of the right neighbors, but I will be able to, you know, giving up a small amount of accuracy or a small amount of this recall of the right neighbors.
00:26:24
Speaker
So let's say you give a 5% of that recall. So you give up 10% of that recall. So 10% of the times, you might not get the correct nearest vectors. You might get an incorrect nearest vector. But for that 5% to 10% recall, and of course, you can fine tune that. You can say, well, I want a 99% recall, and I'm willing to give up performance for that.
00:26:49
Speaker
But usually the trade-off is not direct. So you give up, let's say, 1% recall, but you gain a lot of performance. So then you can run in real time. You can run thousands of queries per second. And that's what approximate nearest neighbors does. It gives you the ability to trade performance, increase performance, decreasing recall or decreasing accuracy. There's multiple algorithms that allow you to do this.
00:27:14
Speaker
But one is probably the most popular one that supports all of the other functionality that databases also need. Are we saying like, just on that accuracy point, is it like if you dropped me into a crowd and said, who are your 10 nearest neighbors? I probably personally wouldn't get exactly take measure out the 10 nearest, but I'd still get roughly the right. I mean, they wouldn't be wildly far away.
00:27:40
Speaker
Yeah. So essentially, what you're saying there is if you look in a vector space, if you think about it in vector space and you draw a bubble around your query vector, and you're saying, I want to draw a bubble that's large enough to encapsulate five nearest neighbors, in brute force search, you'll always get the nearest neighbors because you've kind of
00:28:02
Speaker
from beginning to end, calculated all the distances and sorted them. But here, you don't have the luxury of calculating all the distances. So you need to say, I better be smart about which distances I calculate. And then of the distances that I've calculated, I want to sort them out from lowest to highest.
00:28:19
Speaker
And because I haven't calculated all the distances out, because that would take too long, I might have missed some distances of nearest neighbors. And so now I'm just overlooking that nearest neighbor, and I'm picking one that I think is the nearest. So yes, they'll be close together. Yes, they'll still be close, but they won't be the closest. That's the idea. But I'm not going to suddenly end up with someone completely either side of the room, for instance. No, no, no. So it's highly unlikely that you would
00:28:47
Speaker
is because you're still calculating distances, you're still organizing them to a degree. And

Hierarchical Navigable Small Worlds Model

00:28:52
Speaker
we'll talk about how the approximate nearest neighbors HNSW algorithm works in a second. But the main idea is that you get
00:29:01
Speaker
Let's say you have the nearest neighbor, but you fail to calculate that distance. Now, for all intents and purposes, this neighbor is invisible. So now you're going to say, what is the nearest neighbor that I calculated? So it's not going to be like you'll get all sorts of wonky distances. You're still going through the same sorting them based on distances in ascending order and pick the top five. But this top five might not have the correct five, just the five that you calculate the distances for.
00:29:27
Speaker
Right. So I might get five out of the top 10, for instance. Exactly. Yeah, exactly. OK. That convinces me to give up a little accuracy for a lot of performance. Yeah. So how does it work? So the performance that we're talking about here, you go from runtime complexity of all of dimensionality times n to now runtime complexity of logarithm of n. And that's a very, very scalable
00:29:55
Speaker
kind of an algorithm. And how this works is, essentially, intuitively, when you're searching for vectors that are close to your query vector, what you want to do is structure your search such that you make big jumps earlier on. So let's say you have your database of vectors. You come into a random object. And you say, how close is this object to my query?
00:30:25
Speaker
What you want to do is initially when you come in, you don't want to have all of your vectors being searched over because that would kind of fall back down to brute force search. You want to structure it nicely. So one way that was proposed to do this is called the hierarchical navigable search, hierarchical navigable small worlds model. And what this does is it takes your vectors and it makes a graph out of them.
00:30:53
Speaker
and it makes a hierarchy of them. And what it does is when you enter the search, you enter at a top level. And at the top level, there's only large distances that are available. So you take vectors that are very far from each other. So now you're almost like taking a highway from one vector to the other vector, and these vectors are quite far away. So you're saying of these vectors that are really far away,
00:31:20
Speaker
which one is closest to my query vector.
00:31:24
Speaker
So you're almost saying, I want to quickly localize which region or what type of thing my query vector is asking about. So rather than search for every other vector, I'm just going to say, out of these 500 vectors, and I might have, let's say, 10 million vectors, these 500 vectors are farther apart. They allow me to explore vector space efficiently. I want to find out which vector is closest to my query vector. And these are bigger jumps in vector space.
00:31:54
Speaker
I have a mental image that this is a bit like if you're looking for a house in a country, let's say in England, houses aren't uniformly distributed across England. They're clustered together in cities. So I start by indexing which is the nearest city. Exactly.
00:32:09
Speaker
So instead of saying, well, if you're interested in this house, let me show you the next house, the next house, the next house, the next house, instead of searching locally, exhaustively, you search in different neighborhoods and you say, okay, this is one house in this neighborhood. Are you interested in this? This is another house, this is another house. So you find...
00:32:25
Speaker
the global region that you're interested in, and then you dig down deeper within that global region. So you do a course search and then a more fine-grained search within that neighborhood, exactly. So the highest hierarchy here is going to perhaps show you one house using your analogy from every neighborhood.
00:32:46
Speaker
One of my university lecturers would be proud that I can now see how this starts to become log in. Because you're building a hierarchy that gets gradually more and more detailed. Is it multi-layered?
00:32:59
Speaker
Yes, so you start off at the highest level and then you go down levels and you can have 15 levels, five levels, however much that you want, but the highest level has an exponentially low number of data points. So you can almost think of starting off at the bottom level and that level has all of your data points.
00:33:21
Speaker
every vector, let's say you have a hundred million vectors, all of the vectors exist at the bottom level. And then as you go up one level, these data points start to drop down. So by the time you reach the highest level, you only have a exponentially decaying number of data points. So they've only survived with a certain probability up here. So you have a very few number of data points to search over. And then as you drill down, you get more and more vectors.
00:33:49
Speaker
Yeah, it's the exponential decay on the way up that gives you a logarithmic search on the way down. Yeah, exactly. Okay. Does that mean this is expensive to create the index? Are you then having to... Are you doing that search to pre-calculate all the clusters? Or is it smarter than that?
00:34:08
Speaker
No, so it is a lot more intensive to create the index than it is to search the index. Because searching the index is just logarithm of n. Because you come in with a query vector, and then you pick a random vector at your highest level, and you make these big jumps on which neighborhood are you interested in. And then depending on this neighborhood, let me show you more locality, more locality. You keep on adding vectors, and then you keep performing searches
00:34:37
Speaker
all the way down to the lowest level where now you've got your nearest neighbors and you return that. But building this index up is a lot more difficult. And so usually if you're building an index, it'll take anywhere from hours, whereas searching is you can perform thousands of searches per second. Okay. Is it just try and get a sense of this? So
00:35:02
Speaker
If I've got a large index pre-calculated and I want to add something in, how expensive is that if I add one new document? So adding is relatively quick. So this is one of the plus points of the HNSW algorithm. And this is one of the reasons why VVA uses the HNSW algorithm. It's one of the few approximate nearest neighbors algorithms that not only is quick, is log of n,
00:35:30
Speaker
But it also supports insertion and deletion because every database needs CRUD operations. You need to be able to add data points and not have to reconstruct the entire index because then every update or every insertion would be hours long. Yeah. So you can simply add a data point.
00:35:48
Speaker
and that data point gets added at the bottom level because every data point exists at the bottom level. And then again, you have that exponentially decaying probability of whether or not that data point survives at the next level, at the next level, at the next level. So at some point that vector object is going to die off and it's only going to reach up to a certain level into your hierarchy, into your index. So that's all insertion takes.
00:36:11
Speaker
That's that one other small detail which i'm gonna ask you about are you saying as there's less and less probability of it surviving are you saying at the top level we've maybe got let's say a dozen surviving points they are actual documents and not like average centroids.
00:36:29
Speaker
No, no, they're actual documents. Okay. So we can talk about this, but there are algorithms that allow you to approximate vectors using centroids, so using product quantization or using k-means clustering. But just HNSW by itself doesn't average the vectors out. It uses the actual vectors to search over.
00:36:50
Speaker
Okay. I wonder if that means you could like, if your index got sufficiently large, you could shortcut, you wouldn't have to go all the way down to the search. But I guess you're saying it's fast enough anyway. Yeah. So yeah, because you have this log n, I don't know if you would, you could, you mean cut it off at a certain point with before it reaches the bottom? Yeah, you get the five levels down, you say, well, that's good enough for now.
00:37:17
Speaker
Yeah, I think you could do that, but I'm not too sure. I guess you could, but you wouldn't need to bother. Yeah, because it's already real time. So even if you have hundreds of millions of documents, you can search hundreds of queries per second. So unless you're trying to push recall to a degree where now you're kind of falling back onto
00:37:41
Speaker
brute force search, then I think you would have to say, okay, maybe we don't go down to the bottom level. But then again, you would have the problem that recall would be hindered as a result of that rate. Okay. Okay. Give me an idea of the size of this index once you've built it. Because there's got to be huge compression from the original documents down to the vectors, but then is the index
00:38:10
Speaker
presumably the index is larger than the vector set because it's got multiple copies at different hierarchies. Yeah. So the, if you just look at one vector, the index, if you have 10,000 documents, you'll have 10,000 vectors, but then the index has, um, yeah. So this is a good question. I'm not sure if multiple vectors are actually stored or you just say that this vector is also at this index. Just refer to it, uh, here, but it also depends on how many vectors you have.
00:38:41
Speaker
And so, for example, when we tested this with the sphere data set, which has around 900 million objects, when we vectorized that and we created the index for that, I believe we had thousands of gigabytes. So the index was thousand or...
00:39:02
Speaker
hundreds of gigabytes large. I can't remember the exact numbers, but we've written a blog on this that people can refer to. So the index was that large. And this is where the compression algorithms that I was talking about come into place because the vector database runs in memory. And so if your data set gets very large, it can get very expensive.
00:39:24
Speaker
So that raises the question, this is all on memory, but how is this stored on disk? Storing lots and lots of lists of floating point numbers, is this some kind of column thing on disk?
00:39:40
Speaker
So on disk, everything, all the vectors, when they're indexed and you're searching over them, this is happening in memory. It's all in memory. It's all in memory. There are some algorithms which propose doing read-writes from disk. And in order to compress and reduce memory usage, that's what you would have to do. So we've recently announced this, where you can keep in memory
00:40:10
Speaker
compressed centroid representations of vectors. And then on disk, you store the full representation. So what you do is you do this core search over k-mean centroid vectors. And then you say, these are my 100 closest vectors that I'm interested in. Now read them all in from disk and perform the finer search over the 100 correct vectors.
00:40:35
Speaker
So that's one way to balance between storing the whole index in memory versus storing a compressed version of the index in memory, using that to identify which centroids I'm interested in and then reading those vectors in and then re-performing or re-scoring those distances so you get back the recall.
00:40:56
Speaker
Yeah. And presumably there'd be hot pages in that. So you could keep the most important things still in memory. Yeah. Okay. Okay. So let let's move on to another topic, which you've hinted at, which is one of the early things I remember seeing in this space was you can do cool things like you take the vector for very broadly. I'm hand waving. You can fill in the details, but you take the vector for bridges.
00:41:24
Speaker
And then you take the vector for bridges in Germany and you take the vector for Germany and you subtract Germany from bridges in Germany and add in Mexico and you get documents about bridges in Mexico. And you've been doing this with like going from images to text to sound files. Tell me about that. Yeah. So this is this is a field that I'm really interested in, this idea of multimodality, because
00:41:55
Speaker
A lot of people are doing vector search over text documents right now, but I think the future is doing this vector search over multimodal data. Because the idea is that there's no reason why we should limit these neural networks to just text data. And in fact, before we had this revolution in natural language processing, we had a revolution in computer vision. The original paper that was published by Hinton in 2012 was on ImageNet.
00:42:23
Speaker
Then he used NVIDIA GPUs to train this AlexNet model that got state-of-the-art on solving ImageNet.
00:42:38
Speaker
These models can understand not just text, not just sentences, not just documents, but also images, video, which are just evolutions of images. They can also understand audio files. They can understand all sorts of data. So why limit the vector representation to just text? We're more comfortable with text, but I think now that people are getting used to vector search, I think the next thing that is going to be very popular is
00:43:08
Speaker
storing images in the form of vectors, storing videos, storing audio files in the form of vectors. And because the vector database, all it's doing is searching over vectors, it doesn't care whether the input was an audio, image, video. A vector is a vector. And I can just as easily build an HNSW index over a vector of audio files as I can a vector of text documents.
00:43:37
Speaker
So

Multimodal Models and Their Potential

00:43:37
Speaker
now the idea is that now that if we have models that can represent this multimodal data as vectors, we can search over them. It gets really cool when you have models that can capture not just one modality. So you might have a model that understands text and a separate model that understands audio, which is interesting because they allow you to perform, you know, text to text search, audio to audio search individually within modality.
00:44:07
Speaker
But what gets me really excited is when you have models that can understand multiple modalities so a model that could potentially understand images and text like clip from open AI for example. Where now you can search for a concept like cat and you get back images of the cat.
00:44:28
Speaker
And there, you're not really matching words with images in terms of metadata, but rather you're matching the vector of the word with the vector of the image. And those two vectors happen to have a lower distance between them, and that's why they get retrieved as nearest neighbors.
00:44:47
Speaker
Right, so they both get indexed into roughly the same space, so of course you get to, and presumably if you index lots of these different things, any text query would land you in a place that was surrounded by documents and images, waveforms, videos. But you said earlier it's like,
00:45:08
Speaker
one model will encode things into English and other code encodes them into German and you can't just mix and match vectors. Yeah. So do I need a neural net that's been trained to do all these things? Yeah. So that's, that's a great question. Right now there's two kind of, um,
00:45:25
Speaker
developments in this field of multimodal neural network modeling. One development is more practical, and this goes in line with the ones that we've integrated so far. So if you look at CLIP from OpenAI, or if you look at image bind from Meta that was released earlier this year, these types of multimodal models, so CLIP, for example, understands images and text.
00:45:50
Speaker
And image bind understands text, images, audio, video. And there's a couple of other modalities which are not as interesting, but these are the four main modalities that it understands. This model, image bind, is actually six independent models. So there's one model that's a specialist at identifying images, another one that's a specialist at identifying words. OK.
00:46:18
Speaker
And so you've got six separate models. And now the problem is what you said. Each of one of these models speaks its own vector language. So if I take the image of a cat, I'm going to get a barcode. And that barcode can be very different from the barcode of the word cat itself.
00:46:36
Speaker
Now the trick is how do I unify the vector spaces such that the barcode for a cat, the image of a cat and the word cat and maybe the video of a cat and the cat meowing are all lending me in the same kind of approximate vector space. And that is the trick.
00:47:01
Speaker
Yeah, so what they do in the image buying paper is they use contrastive learning. So they say, here is one data point, this can be, and they do this across modality. So let's talk a little bit about within modality contrastive learning. So for example, if I take five images of dogs, and then I have five images of other classes, then what I can do is pass these through the same model.
00:47:27
Speaker
And I get vector representations for five dogs and five other classes. So now I can say in my vector space, I'd like the data point for dog to be sufficiently close to the data points for other dogs, but far away from the negative examples, far away from the non-dog data points.
00:47:46
Speaker
So through this contrastive training, I can actually push and pull data points in vector space. So once I've generated them, then I can kind of push and pull to my to my liking based on these positive and negative examples. Right. So, hang on. How does that actually translate into that? How does that translate into vectors? And how does that help?
00:48:11
Speaker
Yeah, so for example, let's say you have a model that understands images. You're going to pass your images through and the model generates the vectors. So one way that you can train this model is just to classify. Let's say you have a model that takes in images and its job is to output a probability per class.
00:48:32
Speaker
And you train this classifier and then you take some slice, some representation of weights in between before it gets to the classification part. And that becomes your vector representation. Okay. So now I'm sort of doing a centroid for dogs. Is that the solution?
00:48:51
Speaker
Yes, so the idea is that if it's trained in order for it to correctly classify among the different or distinguish between the different classes, it has to be able to identify what's unique about dogs versus what's different from a dog than a cat versus a cake.
00:49:10
Speaker
But it has to nudge the weights such that it can identify dogs from cakes from cats. And in doing so, it identifies or it localizes different parts of your training set in different parts of vector space. And then you take that part of the vector space and you say, now these are my embeddings. This is my vector representation for what a dog is. And I'm going to use that with my vector database.
00:49:34
Speaker
And so if your model can give you this, you can then go forward with this idea of contrastive learning that I talked about. OK. So I'm still not entirely getting how that takes me between different modalities. Yeah. So that's the trick where, let's say you have a model that you've trained independently to identify visual features. It can take in images. You've got another one that understands text and another one that understands audio.
00:50:03
Speaker
If these are all generating vectors of the same dimensionality, or you've kind of modified them to generate vectors of the same dimensionality, now you're getting a image representation of 500 dimensions, a word representation of 500 dimensions. Now what you want to do is say you take the image representation of a dog and you take its vector. So you get 500 numbers, you get that barcode. What you want to do is then
00:50:28
Speaker
train it further to say this vector representation of the image should be close to the vector representation of the word dog, of a dog barking, of the video of the dog. So you pull together the vectors across modalities in this multidimensional vector space and you push apart the concepts that shouldn't be close together.
00:50:50
Speaker
Right, that's the part I'm missing. When you get those five dogs, there are actually five different modalities of dogs. Exactly. So now you've got a concept. You know how babies, when you explain a new concept to them, they're not just taking in that concept like a machine learning model would.
00:51:09
Speaker
they understand the concept across all these modalities so they when you explain to them that's a dog they see the dog they can hear the dog they can kind of see the fur wave so they understand the motion dynamics of what happens when the air interacts with the dog so they understand all of this they might even get smell.
00:51:27
Speaker
Yeah. So now you've got this multi-dimensional vector space where you can pass in a query and then you can retrieve the nearest neighbors across multiple modalities. And so that's what this image bind model is doing. So you're kind of forming a gestalt of all the different vector types. Exactly. Yeah. Okay.
00:51:48
Speaker
And so this is one approach, and this is the practical approach. So the last module that we released with VV8 was the image bind, multi to vec image bind, where you can take any type of multimedia that the model understands. It translates from that multimedia type to a vector. And then you can search within this vector space.
00:52:08
Speaker
And you can do some really cool things with this. But the other approach here that I haven't seen work as well, I don't know if it's as advanced as this approach of training one model per data type and then fusing them, unifying the vector spaces, is to, like you said, take a model and just train it from scratch to understand all of these modalities. So there's no different model. It's just one model that understands everything.
00:52:38
Speaker
and it has to learn to differentiate and then group itself. That seems like it would be a lot more difficult.
00:52:45
Speaker
It is a lot more difficult, but it is a lot more scalable because now, think about what would happen earlier. Let's say you want to encode 25 modalities, you have 25 models that you're training, and then you have to fuse them. How do you fuse them? Well, you need to hand curate. These are positive examples, these are negative examples. Pull these closer to this representation, push these concepts away.
00:53:09
Speaker
It doesn't scale. Whereas this one, because one model understands it, you don't need to unify. The model itself in the optimization process unifies for you. But it is a more difficult training process for sure, which is why I don't think we have practical implementations of these second types of multimodal models. There are proposals, but I haven't seen one that works as well as this image bind model from Meta.
00:53:35
Speaker
So this is slightly ahead of the cutting edge. That's the future you're hoping for. Exactly. But I think this is the future, this type of one model that understands all of these modalities, if not all the important modalities like audio, video, image, and text. I think if we can get a model that understands these types of modalities, I think they will function better. But for now, we have this kind of duct tape solution of six different models, unify them so that they each speak the same language, and then now you can perform
00:54:03
Speaker
Very interesting multimodal search, cross-modal search. Given words, you can search for audio files, video files, images. Super cool. Okay. That does sound very cool. So people watching us on YouTube, we'll be able to see there's a lovely painting behind you. And we have a future where I can search for...
00:54:25
Speaker
medieval poetry that's closest to that painting. Exactly. Well, not a future. I mean, you can do it now. So we have a we have a module with Weavie and we've we've actually created a demo of this where we see the database with audio files, with video files, images and.
00:54:45
Speaker
then we can search over it with any of these modalities. So you could say, today, I'm feeling like this song. That could be your query. And you say, here are some paintings that are close to that query, or here are some videos that are close to that query. And you can do that right now. This kind of cross-modal search is completely possible.
00:55:04
Speaker
That seems like the most mad fun I could have since Markov chains mashing together heavy metal lyrics and Alice in Wonderland. Yeah, it's amazing. We're actually having multiple hackathons around this concept of multimodality because once you give people this machine learning model that can understand all these types of data and the ability to scale up to
00:55:28
Speaker
millions or hundreds of millions of these data types then really the your imagination is the is the limiting factor what can you put together are you gonna search for audio over audio like a shazam type of application or you can do all sorts of interesting things with this okay this sounds like fun so to wrap up them why don't you give me the very high level overview of how i could do this at home.
00:55:53
Speaker
And I'll allow you a quick plug of Weaviate for this. All my demos are open source. So if you go to Weaviate Tutorials, that repository has a bunch of these implemented. If you go to Weaviate Examples, there's a bunch of implementations of this as well, starting off with basics of how do you search over text documents,
00:56:18
Speaker
all the way up to how do I search over these multimodal documents as well. So you can look at examples there. Also, we're implementing newer and newer modules pretty much every month. So if we see a really new, cool model that we think would be helpful in vectorizing data, like this multimodal image bind one from Meta, we integrate it so that if you take your data, you can simply just point us towards that data.
00:56:45
Speaker
Choose which module we're gonna use and which model we're gonna use to vectorize it and then we store all the vectors and then you can come along and say this is a new file, this is an image or an audio. We handle all of the vectorization, all of the going from data point or data object to passing it through the model generating the vector and then searching over it.
00:57:06
Speaker
So in terms of how many lines of code, it's relatively simple. I think it's only 50 lines of code to creating this multimodal search engine. And there's an example of this that you can check out as well. Outside of this, if you're interested, I would join the Slack community. We've got a forum as well if you have questions when you're playing around with this. So yeah, that's where I would look.
00:57:33
Speaker
getting feedback from the community as well. So we've recently announced a new Python client. So if you do try to make this multimodal search engine, I would recommend you use the new Python client. And if you have feedback, tell us about it. If you think that we can improve things, definitely reach out with that. And yeah, we'd love to see what people build.
00:57:55
Speaker
Cool. I'll get some links from you offline and we'll stick those on the show notes. Yeah, for sure. And so the last question I wanted to ask you, really brass tacks, is Weaviate. It's one of these models where it's an open source database plus you can pay us to host it for you. Exactly. That's the very high level view. Which parts are open source and what's the license? Just so I know.
00:58:20
Speaker
Yeah, so it's fully open source and the same code or the same source code that paid customers get is the exact same code that you can access on GitHub with V8.
00:58:36
Speaker
The base model, the base code is exactly the same. If you deploy it locally on your computer, it's exactly the same. If you deploy it into your cloud, it's exactly the same. Essentially, what we charge for is managed instances of Weaviate. So if you deploy Weaviate on WCS, Weaviate Cloud Services, then we charge for that.
00:58:57
Speaker
or if we manage your database in your own cloud environment, then we charge for that. So that's what we're charging for, but the code itself is exactly the same. So if you have the capabilities to run the database, then you get the exact same functionality and performance. And also we're improving it daily. So it's not that
00:59:16
Speaker
One performance is better than the other. There is only one version of Weavey. The free and the paid version is exactly the same. Cool. I like that business model. It's like you can use exactly the same at home until it becomes a headache to manage and then we'll do it for you at a price. Yeah, that's fine. Exactly. Okay. Well, thank you very much. I finally feel like I have an understanding of how the computer science under the hood works.
00:59:39
Speaker
Awesome. Yeah. So this was very interesting because as a data scientist, I understood the k-nearest neighbors and the brute force approach. And then I was wondering myself how this would scale up. And so I did a deep dive. It's been almost a year since I joined the company. But it was very interesting to get an engineering understanding of how this is implemented, which is very new to me. So all of the database implementations and how you do the approximate nearest neighbors, learning all of that was very fun coming from a data science background.
01:00:08
Speaker
Yeah, the technology is moving so fast that not only is the machine learning component moving fast, but the engineering stuff is also moving fast. So we're pushing out updates monthly around upgrading the performance, upgrading the features that you have within the database as well. And then the machine learning world is moving so fast that we're also integrating multiple modules so that you can represent data better and better in these vector formats as well.
01:00:37
Speaker
Well, with the world moving so fast, I should probably leave you to get back to keeping up. Awesome. Sam, thank you very much for joining us. It was great fun. Thank you, Chris. Sam, thank you very much.

Episode Wrap-Up and Exploration Encouragement

01:00:49
Speaker
I've been playing with those ideas for a few months now, but that's the first time I got a sense that I can trace the bytes from the keyboard through the data structures all the way out into the screen again. That sense that you can see how the data is processed every step.
01:01:06
Speaker
And I love that feeling. I love getting a new architecture inside my head, feeling a certain sense of mastery over it, you know? Thank you. Hopefully, listening at home, you've had a similar sense of revelatory clarity. I certainly hope you have. If you have, I'll just simply take a moment to remind you that the like and rate and share buttons are there waiting for you in your app.
01:01:33
Speaker
And we will be back again next week with another look into the world of programming through another developer's voice. So you might want to click subscribe and notify to make sure you catch it. And until next week, I've been your host, Chris Jenkins. This has been Developer Voices with Zan Hasan. Thanks for listening.