Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
The Open Source AI Revolution Begins Now... image

The Open Source AI Revolution Begins Now...

Developer Voices
Avatar
783 Plays1 year ago

LLMs like ChatGPT are not just fascinating, they're becoming increasing useful in our working lives. They've graduated from novelty to valuable tool. But building those tools is still in the hands of huge companies. Or is it?

In this week's episode of Developer Voices, we're learning how you can run LLMs on your own laptop, and how you can customize the system to make a tailored research assistant, a better documentation-searcher, and much more. All you need is a guide on which pieces you need, and how they fit together, and that's exactly what this week's guest—Tobi Fankhänel—is here to take us through.

A leaked memo from Google recently outlined how the Big Company Advantage has almost completely eroded, and how the next wave of LLM development is going to come from the open source community. So hackers rise up - the open source AI revolution begins now!

--

Kris on Twitter: https://twitter.com/krisajenkins
Kris on LinkedIn: https://www.linkedin.com/in/krisjenkins/
Tobias on LinkedIn: https://www.linkedin.com/in/tobias-fankh%C3%A4nel-749712180/
Tobias’ blog: https://blog.exxample.eu
LangChain: https://python.langchain.com/docs/get_started/introduction.html
Embeddings: https://weaviate.io/blog/vector-embeddings-explained
Vector Databases: https://en.wikipedia.org/wiki/Vector_database
"We have no moat" – Google Employee on Open-source LLMs: https://www.semianalysis.com/p/google-we-have-no-moat-and-neither
“Attention is all you need” - https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Timeline since Meta open-sourced their first-gen models: https://www.semianalysis.com/i/119223672/the-timeline
Run LLMs on CPU only or, since May, mix CPU and GPU usage: https://github.com/abetlen/llama-cpp-python
Samantha: https://erichartford.com/meet-samantha
Embedding model leaderboards: https://huggingface.co/spaces/mteb/leaderboard
Open-source LLMs: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
LLaMA: https://ai.facebook.com/blog/large-language-model-llama-meta-ai/
Blog post: Design-pattern ‘In-context learning’ https://a16z.com/2023/06/20/emerging-architectures-for-llm-applications/#section--2
Tobi's GitHub branch ‘In-context learning with LangChain’ https://github.com/aviav/turmbauten/blob/spaghetti-code/CHANGELOG.md
Prompt Syntax Cheat Sheet: https://github.com/oobabooga/text-generation-webui/tree/main/characters/instruction-following
Google Workspace Labs Sign-Up: https://workspace.google.com/labs-sign-up/
GMail Workspace Labs Demo Video, click ‘See it in action’: https://workspace.google.com/solutions/ai/#m10
Prediction trading on open-source LLMs vs GPT-4: https://manifold.markets/PeterWildeford/will-i-peter-wildeford-think-that-t-c95ff3c1b385

Recommended
Transcript

Dismissal to Adoption of ChatGPT

00:00:00
Speaker
So I have to confess something to you. When all the noise started to kick off about chat GPT, I kind of dismissed it as hype. You know, it was fun, sure it was a lot of fun, but it was another AI toy like those image generators that was here today but would drop off the radar in six months. But then I couldn't help but notice how many of my friends kept talking about how they were using it.
00:00:27
Speaker
Not how exciting it was, not how cool it was, not how it was going to change the world, oh my god, we're all going to the moon, hooray. They were actually talking about being users. How it was helping them to get stuff done day to day.
00:00:42
Speaker
And that's when I started to pay more attention. And wouldn't you know, yeah, I've become a day-to-day user myself. I'll tell you the number one thing I use it for, it rewrites my LinkedIn posts. It's just way better at that kind of corporate-y tone than I am. So yeah, I get AI to leverage my synergies these days.

Personalized AI for Developers

00:01:02
Speaker
So while I've started to see it as an actual useful tool, until recently I assumed it was a tool that was owned by people with access to supercomputers, or at least massive AWS cluster budgets.
00:01:17
Speaker
Not so. I recently got talking to Toby Funkinell and he's opened my eyes. This stuff is now firmly going into the hands of regular developers. We're about to see personalised AI on our desktops because we're about to have the tools to hack them together ourselves.
00:01:36
Speaker
And I think it probably won't be long before, you know, when you check out a new programming language, you end up installing a language extension that ships with syntax highlighting and LSP and debugger support and a custom AI model that's been trained specifically for the language by the extensions developer.

OpenAI Misconceptions and Accessibility

00:01:56
Speaker
I can see it coming. And I think by the end of this episode, you'll be able to see it coming too. And if it excites you, as I hope it will, Toby's got some explicit tips on how to get started building this stuff. So let's figure out how to put AI in our hands. I'm your host, Chris Jenkins. This is Developer Voices. And today's voice is Toby Funkinell.
00:02:29
Speaker
Joining me today is Toby Funkhainol. Toby, how are you doing? Oh, pretty great. Thanks. How are you? I'm good. I'm glad to see you. We saw each other in person last week, which is a rare treat for me as a podcast host. And the week before. Yeah. Yeah, we did actually. You were in England the week before and I was in Germany the week after. So who knows what we'll do next week.
00:02:53
Speaker
But one of the things we got talking about when we first met was OpenAI, right? I've always thought of OpenAI as this enormous, very clever database of billions of whatevers that is the state of the art and completely out of our hands and this tool we use. And then you mentioned to me this blog post from someone at Google. We have no motor, I think it was called.
00:03:22
Speaker
And it basically said, to paraphrase, oh God, we're being hosed by the open source world.
00:03:30
Speaker
And this is what I want you to tell me about. How can we run our own open APIs? The open source side is sufficiently rich that we can get involved. Yes. So the state of the technology today is that it's possible even to run LLMs on hardware that do not even have a GPU. That's how far it has progressed.
00:03:56
Speaker
First a note of caution that memo it was written by a Google employee and it was said to be a leak it was maybe not meant to be published and also it's a solitary opinion but still it gained lots of media coverage and at the time it was published.

Open-Source LLMs and Hardware

00:04:12
Speaker
META had recently released some of their first generation large language models, the llama models. I think spelled like the animal only with a slightly different capitalization and large language model
00:04:31
Speaker
Yeah, something like that. I don't even know. Wikipedia would know. And yeah, of course, Amita is a competitor to Google in some ways. But what they had done is they open sourced the models, they published them under an open source license. And then if I'm not mistaken, they only limited the access to the model weights, which I needed to
00:04:54
Speaker
To really use the models and they they provided researchers access to that according to a waiting list and then after a while like two weeks i think after they started in the beginning of march this year the weights also.
00:05:12
Speaker
Well, it said leaked. People could download them via torrent at first and then the whole thing really took off because suddenly there were reasonably capable models, not at the level of GPT-4 by far, but
00:05:29
Speaker
capable and some of them could be run on consumer grade hardware from the start and so everybody could suddenly get into LLM research without needing a cluster, without needing like some membership in a research group or some access that was limited to that point and it seems many people were eager to get their hands on that kind of technology
00:05:55
Speaker
And so within the span of a few months, since March, multiple things have happened.

AI Models and Mechanisms

00:06:01
Speaker
So let me just check I've got this right first. So the model is like, how do we configure this big, clever neural network? And then the weights are like, once you've configured it, you have to run vast amounts of data through it to train it. And you end up with the magic multipliers that set the weight of the neural net, right? So you need those two pieces to do anything interesting.
00:06:25
Speaker
Yeah, I think so. I need to be careful here because I've done lots of things with those llama-based models at home, but the thing I haven't done is to try to fine-tune it or to try to train a model myself from the ground up.
00:06:43
Speaker
I might just be saying nonsense here, but the model is something that needs to be trained. And then what happens there, the state of the art in terms of LLMs, it uses transformer based architecture. And it goes back to a paper from, I think, 2016 or something that that is titled
00:07:07
Speaker
attention is all you need. Basically, it introduces an attention mechanism into the way that models are trained. Then it's possible with an existing model to fine-tune it. I think what you receive there
00:07:27
Speaker
That's probably the way it's, but better look it up. As I say, I'm going to get into that at some point to work on it myself. But right now, the simple use cases, they are interesting enough when it comes to applications. And so I'm just retelling basically what happened. It started to be multiple things that made it easier for people to run their models on more and more hardware.
00:07:52
Speaker
and lower and lower spec hardware. And so at first there was a way to kind of, if I understand it right, modularize the fine tunings called Q-LORRA. And that was even before I got into working with those models. And by now it is, what has happened is the models have been re-quantized firstly. So quantization maybe, I quickly say something about that, I think.
00:08:22
Speaker
the original llama models. They were quantized with 16-bit numbers, as far as I understand it. You are dealing with vector spaces, which have a few hundred or a few thousand dimensions in current generation, large language models.
00:08:39
Speaker
And then words are represented or tokens which might be a bit shorter than words or sometimes might be multiple words in a few cases they are represented as vectors and those vectors they consist of numbers like in a five hundred dimensional vector space you would have.
00:08:57
Speaker
one token and it is assigned a vector which has 500 numbers. Those numbers need to be multiplied a lot, those vectors multiplied with matrices to do the whole processing of a large language model. Basically, that's something that GPUs are specialized to do. They perform well on this.
00:09:19
Speaker
Essentially, you're downloading a giant matrix database of 16-bit floating point numbers. Yeah, that's the way I imagined it. The quantization then, what happens there is initially you have 16-bit precision numbers, floating point numbers.
00:09:38
Speaker
It turns out if you just decrease the exactitude, I don't know how we would call it in English, the precision of the numbers to 8-bit or nowadays even to 4-bit and then people are going even lower than that.
00:09:57
Speaker
the llms still perform and deliver useful output and that of course means it makes a huge difference if you're multiplying a sixteen bit number with a sixteen bit number or if you're multiplying like a four bit number with a four bit number and then
00:10:13
Speaker
what you need to save in terms of what needs to be stored in memory, what you need in terms of storage, then how often the memory needs to be swapped in and out. And then, of course, what kinds of hardware are capable of running these models. And so this is a big part of the recent progress that has been happening, just decreasing the quantization and by that enabling more and more people to run those things on their own hardware.
00:10:43
Speaker
So people have been like down sampling these databases so you can run it on more commodity hardware and see if it still gives you a good enough result. Yes. And some of the research in the academic field that's going on is also dealing with this and testing and comparing according to various benchmarks. If I decrease the quantization further, or if I try to become a bit smarter about
00:11:08
Speaker
the whole vector space and maybe I'm a bit more exact in some areas and a bit less exact in other areas.

Vector Databases and Embeddings

00:11:16
Speaker
How can I get close to the 16-bit original model's output quality with much, much lower resource usage?
00:11:27
Speaker
Yeah, you would think as soon as you've published a useful database of 16-bit floating point matrices, someone who knows about compression will get involved, right? Yeah, yeah. Many people got involved and some really specialized into that. There are some people who are regularly like re-quantizing new models or fine tunings of the original llama models that appear.
00:11:52
Speaker
then by now it's not only the llama models that are there and since uh bless you and since um the i think since the middle of may um there's a this library that's called llama cpp which is based on c++ it's been around a bit longer than may but what happened in may is it started to offer a way to run uh the models in a mixed mode between cpu and gpu and people can just uh
00:12:20
Speaker
Set a config, I want to have this number of layers which are processed by the GPU and are living in the video RAM, and all the rest can live in just the normal, what is called CPU RAM in this context, just the normal RAM of the computer. And then if you have a model that is maybe a size of
00:12:41
Speaker
60 gigabytes and you have a graphics card that maybe has 10, 12 gigabytes VRAM or 24 is pretty common with the top of the line gaming GPUs is fairly widespread. Then you can just say, okay, I want to have a part of it running in my GPU, allowing in my GPU. And then if it's maybe just a bit too big for the GPU RAM, just a few layers need to be computed by CPU and then you get already very good performance and it's very flexible.
00:13:09
Speaker
OK, yeah, because that kind of network, neural network topology does lend itself to having multiple stages in different places. Yes. But OK, so here I am. I recently bought myself a new MacBook and it has way more GPUs than I think I need. So if I want to put that to an A.I., what kind of I want you to tell me how to do it. But first, what kind of results can I expect on my little laptop? It depends on the specs, I would say. So
00:13:39
Speaker
I mean, nowadays you could basically, I think if you just have 64 gigabytes of RAM, normally RAM in that notebook, then you can run all of the llama-based models in some quantizations like 4-bit or 5-bit or I think even
00:14:01
Speaker
Well, with 65 billion parameters, that's the biggest llama model that has been released. I'm not quite sure, but I think even that should fit. The thing is, the more that the CPU gets involved, the performance in terms of token output, token frequency, it decreases. But the quality of the output does not decrease. Yeah, so it doesn't matter. It's like a speed to answer. Yes, exactly.
00:14:27
Speaker
You should think logically we've spent all these years building these dedicated circuits for multiplying matrices so it can display polygons on the screen. It's nice to know there's another useful reason for it. These specialized matrix multipliers.
00:14:46
Speaker
So what does this actually translate to I'm going to you're going to tell me how I can run my own local version of an API because one thing I would like to do with something like this is problem with open API is it never has that specialized data set you're interested in.
00:15:04
Speaker
Yeah, that's true. And that's that's where vector databases come in. That's a component that can be used in the stack. Just now you were saying open API, I think you're referring to open AI, right? And yeah, exactly. So I think open AI also has various ways for people to to just
00:15:28
Speaker
post their data to them and they will then host it and give people a way to use it with their models. I've never tried that yet, but the exciting thing is that actually if you have your own data and it's just not part of any model that has been trained or fine-tuned and also you're not getting into fine-tuning because it's still fairly expensive or maybe not that easy.
00:15:51
Speaker
then you can just use a vector database together with an embedding model. That's today a very low threshold of entry, a very low entry threshold way to use your own data with LLMs.
00:16:11
Speaker
Okay, how does that work? What's the architecture there? Yes, so there are embedding models, which the purpose of them is to take some input, which is in natural language, most commonly, and to translate it into vectors. And so, for example, I have a big book or I have a collection of documents, then what I do is
00:16:35
Speaker
Firstly, I will do some processing on them. They should be nicely formatted. Yeah, not too many errors because then the retrieval would also suffer later on. And then I split up that book into chunks and those chunks are individually processed by an embedding model. And so each of those document chunks, they might be like 500 or 1000 characters long or something like that is also assigned a vector.
00:17:04
Speaker
And those vectors, they
00:17:08
Speaker
they often, if the embedding model is well suited to the domain and to the language that the book is published in, then those vectors, they correspond somewhat to the semantics. So I can do, when I've done this kind of embedding, as it's called, connecting the snippets with the vectors, I can do natural language retrieval, which means I can just write a question in natural language,
00:17:37
Speaker
And then this question will also go through the same embedding model and then will also be transformed into a vector. And then what's fairly simple is to find
00:17:52
Speaker
what is the distance between two vectors and so then using that query the document snippets from before that are closest to the query vector they are retrieved from the vector database and often it turns out they are semantically close to the query so you can have a big document and quickly search it using natural language
00:18:16
Speaker
somewhat fuzzy and it's a lot of trial and error for people getting into it, but also nowadays there are open source embedding models for download, like the language chain to connect it all up, and then vector databases are also something that people can run on their local systems. There are open source vector databases.
00:18:36
Speaker
So let me check I've understood this. So I get a vector database, like maybe the extension for Postgres or something like Weaviate or something like that. And I chop up my, let's say my product documentation, Apache Kafka documentation. There's loads of it. It's really hard to search. I chop that up. It gets turned into vectors of floating point numbers in the database. Then I say, how do you configure a partition in Kafka?
00:19:04
Speaker
It turns that into a floating point vector and then finds other vectors near it by the magic of indexes. Yes, indexes are important as well because, of course, if you have lots and lots of documents in the database and need to compute the distance of the query to all of the other vectors, that would take a long time. So there are clever ways to do indexing and to put, for example, the
00:19:27
Speaker
the closest vectors to one vector that's already in the database in the index so that the search can happen quickly. But yes, that's exactly what happens. And this is also exactly one of the use cases that could be treated by that. That's cool. OK, so I want to get into how that relates to things like large language models. But let's just check first. Is that language agnostic? I mean, humans, like, if I wanted to do that in German, would I have exactly the same experience?
00:19:56
Speaker
Probably not. So there's different embedding models and there's this platform called Hugging Face where you can find lots of different machine learning models in general. They have embedding models, they have large language models for download and for trying them out, doing research on them. And Hugging Face also hosts a leaderboard.
00:20:17
Speaker
and leaderboard for the embedding models and then you can find which are the top performers according to various categories and most of the embedding models they are primarily focusing on the english language and so
00:20:33
Speaker
the availability there is much better. But then the creators of those embedding models, I remember the one I'm using, I read the comments and there was positive feedback for one of the English language specific models, according to training anyway, that they also perform well for East Asian languages, for example. And I tried German as well and it seems to work as well. I haven't looked into the details of that, but I
00:21:02
Speaker
I would imagine maybe the training data of that model also just contains some amount of content that is not in English. You have a book and maybe that book is for some reason multilingual and it's used to train that model or you have data from the web and then people post in forums. I don't know what they use in multiple languages. Coincidentally, that also works. For German, I've tried it out with
00:21:28
Speaker
model that is meant to be used with english and also it worked for me the output made sense and uh it's trial and error in that case okay that's curious so is that the piece i'm missing then is the embedding model that does that convert this chunk of text into floating point numbers yes and then the vector database does the indexing and searching yes for similar vectors okay so that's where i need to download this llama download this llama i love that phrase yeah
00:21:58
Speaker
So when you have the vector database set up and have decided which embedding model to use and then maybe linked that all up, for example, with Lang chain, which is a powerful open source tool to link various components that are related to machine learning models, especially large language models.

Llama vs Falcon Models

00:22:18
Speaker
Then there's another leaderboard, for example, where you could start out of open source large language models and they have been benchmarked according to various metrics. Most of the models on that leaderboard are based today on the LAMA models because those are just the models that have had the most optimization that perform the best on low spec hardware or like normal consumer grade.
00:22:46
Speaker
hardware that many people have at home. And so those llama models, they can normally be recognized by their size. There's the biggest one is 65 billion parameters, and they have some are called 30 billion parameters, but it's actually 33. So it's slightly inconsistent. You will find some naming using 33 billion, 30 billion, and then there's 13 billion and 7 billion parameters. That's what the llama-based models are.
00:23:14
Speaker
Right now, it's also constantly changing. The leader of the open-source large-language models is actually a 40 billion parameter model that's called Falcon and that's completely unrelated to the LAMA models. It was released by a research institute in Abu Dhabi.
00:23:35
Speaker
And as a significant advantage over the llama models when it comes to licensing, the license is much more permissive. Now, META has published their first gen models with a license that says you can do research on it and better do your own research. I'm not giving legal advice, but the Falcon models, they
00:23:54
Speaker
They are very capable and have a more permissive license but right now as we speak the drawback is still that for the llama models just Much more optimization has been done. Yes falcon right now would perform not as well as similarly capable llama model This is like breaking news when we're talking this is all happened in the past few months. It has happened in the past few months Yes, and falcon is I think not more than two months old and
00:24:21
Speaker
So you've done this locally with your own datasets, right? And I want you to tell me what you've been using it for personally, but also like what performance, which models do you choose? What kind of performance do you get out of it? So at first I just wanted to try out what can I actually do with those models here, because I'm really, really excited about GPT-4 and what it can do with the models behind that. They are very powerful, very capable.
00:24:50
Speaker
But also, there's a drawback. Sometimes, I'm not sure. If I'm writing a query, will it be considered inappropriate? Will I get flagged? Just the horror of maybe losing access to that kind of resource, because some value system that has been used to filter the prompt doesn't exactly match my own value system.
00:25:14
Speaker
So it's basically a matter of freedom and just being able to just put my unfiltered thoughts into a large language model. And of course, if it's running locally, then I have nothing to worry about regarding that.
00:25:31
Speaker
And so, yeah, but mostly I've really been doing benchmarking and comparing and linking it all up recently. I posted something to my GitHub as well, which is I cleaned up my own Lang chain code which has this
00:25:48
Speaker
toolchain that we've been talking about with a vector database, with embedding model, with a large language model. And it's somewhat easy to use, at least for me. I don't know if anybody else has had a look at it yet. When it comes to there's a new model that's being released and I just want to drop it in quickly. If that model is llama based, then I can try it out rather quickly. And
00:26:11
Speaker
Yeah, so I've been using it for various documents and we've been posting about it. I'm trying to only use public domain data when I make a LinkedIn post or something like that for obvious reasons.
00:26:26
Speaker
But it works quite well with books recently. For example, I've used the Bible, which is one of the public domain books that just comes to mind. It's fairly big. And then I just wanted to know how long does it take to create embeddings from it. And it turns out on my system, it's like less than a minute and I have all of the Bible subdivided into snippets and those snippets created embeddings from them.
00:26:53
Speaker
Yeah, we'll start, we'll sidestep the whole religion discussion. Yeah, of course, of course. That's a nice, large, open source collection of books, right? Yes. Makes sense. Yes.
00:27:03
Speaker
There's lots of open source data out there. You can use yearly reports of companies. You can use whatever you might have put into the data lake of your own company. It's suddenly becoming much more exciting in terms of what's possible with all the data that has been just assembled in some cases over many years.
00:27:27
Speaker
Give me the number. So how long on your laptop, which presumably is like a reasonable spec laptop, how does it take you to index if that's the verb, the Bible, and then what kind of query performance do you get from it? Yeah, well, I've recently, because in part motivated by that, that Google memo, I've recently bought a fairly beefy machine. So it's not a laptop anymore. It's a desktop and it has a 1490 and that which is a,
00:27:55
Speaker
currently the top of the line NVIDIA graphics card. And when I create embeddings from the Bible on that machine, it takes less than a minute. But also, I wouldn't expect it to take much, much longer on lower spec machines. I would need to do testing on that. But currently, I'm saying, what's the point? I want to use the best I have available. And I might move into clusters if I hit some limits.
00:28:21
Speaker
If I'm doing this at home, I can expect minute, not overnight. So what I've seen is I've had a fairly large corpus of text, 1.6 gigabytes, and I split that up into documents that are 1000 characters long. And that took a few hours on that same machine. But I've also noticed that that is much quicker if I use documents that are only 500 characters long. So it seems to be nonlinear to me.
00:28:49
Speaker
Take it with a grain of salt i haven't done a study on this is just my impression from what i've been doing the currently i much prefer document length of five hundred characters just for the performance reason. So this is another decision you've got to make going to choose your name what we need to choose your chunk size.
00:29:07
Speaker
Do you end up iterating a lot and just tweaking parameters, seeing what's going to happen? Yeah, the reason I've decided I want to publish something on GitHub is because I noticed I just started out scripting something with Python and with LangChain and I cannot actually Python, so it became something like...
00:29:25
Speaker
unwieldy when I wanted to extend it further. And so I thought, OK, now I need to do some design pass on it. And so I thought, OK, let's take open source as a motivator. If I want to publish something that another person might theoretically be able to use, then what would be an interface? And then from that perspective, I arrived at a design that is now
00:29:48
Speaker
again, much, much more extensible and configurable and more wieldy again than what I previously had. And I think it's precisely what you say. I want to fiddle with the parameters. I want to drop in different models of different quantizations, try different document sizes, compare. And there's so many variables. And so it's really important
00:30:14
Speaker
if I want to have meaningful knowledge that I'm methodical about it and systematic and so it's important to design it well.

Practical AI Use Cases

00:30:26
Speaker
Yeah, this is cool you're a home AI researcher now, officially.
00:30:37
Speaker
The really big job is getting this language model, which you do from someone else. The lesser, but still quite chunky job is indexing the corpus, in this case, the Bible. And then you've got this final vector database output, right? And that's a thing you can deploy to someone else and they can start querying.
00:30:58
Speaker
Yes. I mean, I can, of course, host a vector database anywhere, or I can also use just some cloud API. If it's not data that I really need to keep on my own systems, then there are cloud providers. There are also cloud providers which care about privacy. And so there's many options there. And so this whole process, you don't even need a local embedding model. If you don't care about the data really remaining private, you can
00:31:27
Speaker
Use an open air embedding model they have a good offering that's becoming less and less expensive in terms of per token cost so there are many many options.
00:31:39
Speaker
OK, I think where I'm getting to in this is I can imagine a future where I've got an AI trained on the data I'm interested in. Maybe I've trained on all the emails I've ever written and all the blog posts I've ever written. And then I would like to deploy that to my phone where it could write emails for me on the go. Yes. Are we approaching that future?
00:31:58
Speaker
There are people who are already doing that, kind of pre-writing their emails based on what they can do with AI and what's popular for people who are doing that who are not that familiar with coding. For example, they have access to low code or no code tools that also provide a way of
00:32:16
Speaker
for example using lang chain connecting the gmail api with with the open ai api so right now it's just maybe a combinatorial explosion of what's possible that's also what you describe it's depending on how much you're willing to do yourself it's not the future it's already possible and it's already being done.
00:32:38
Speaker
So we're in the actual productionizing phase of this stuff. Yeah, of many, many things. It will take many years until the potential of what's currently out there has been exploited. Okay, where should I be looking? I mean, have you got any recommendations? Well, I think right now it's
00:33:01
Speaker
are just so many possibilities and the development is progressing so rapidly that I would say what's a good starting point is precisely this architecture where you're using a vector database to be able to use some data that's just not
00:33:18
Speaker
been used in any training or fine-tuning process for models, and then use an embedding model, use a large language model, and then look, for example, what I've been doing recently, look into more of the Lang chain features. Lang chain has really a lot of ways to deal with the limited context size of the large language model, to do compressions using large language models again, to extract data from various formats of documents.
00:33:45
Speaker
even there and then combine it with a business use case maybe something that you care about personally something that maybe if you have a business if you know business owners talk to them what are their challenges and then just connect
00:34:04
Speaker
maybe connect the dots and I would say due to the sheer number of possibilities of combining and also due to the speed of development, for me the question is not where should I be looking but what filter should I apply so that I end up actually doing something and not only being overwhelmed by all the things that are there to be used.
00:34:29
Speaker
Yeah, the classic Kid in the Candy store problem, if you've got too many options. Exactly, yes. Okay. Is there something in particular you're working on other than slicing up the Bible? Yes, of course. So currently, I'm just starting out to look for a job and obviously there are applications there. So in the beginning, I just take a somewhat global perspective and I think about what are the fields that I would most like to be working in.
00:34:57
Speaker
And so my current challenge is to just find out who are the major players in each of the fields in the domains. And so obviously, when you've got a big company, you also have to publish lots and lots of stuff. And then I can just create embeddings of the daily reports, maybe, maybe even of job postings. Yeah. And then
00:35:21
Speaker
ask natural language questions about it. I make progress on my skill set when it comes to LLMs at the same time as I narrow down where I would like to go next. This is right now what I'm thinking about regarding LLMs. That's brilliant. Are you employing AI for your job, sir? Yes.
00:35:40
Speaker
That's great fun. But how's that working? Are you like, you pick five companies you like and you index them all or you just say like, here's a particular company I'd like to learn about. So I'll query that. There are definitely some companies that I'm eager to be working for potentially no matter what they are currently doing and publishing. And in that case, it's easy. I just try to find out what they do. In some cases, I'm just
00:36:08
Speaker
I just joined a meetup where there's somebody from that company. That's a low-tech way of approaching it. There's still a place for natural intelligence in talking to human beings. Of course. It's a much more efficient way than taking the public route, taking the information that everybody can access. I think a big company that's hiring always also has a spam problem.
00:36:31
Speaker
The public route to hire is often the one that's much more resource intensive as well. One thing that I have in mind is I'm going to take the Fortune 500 and then I'm going to see how do they publish their yearly reports if they publish them in English and then also going to see
00:36:56
Speaker
AI, what are they writing about AI, what may be the subsidiary of doing work related to that field, and other important technologies. I'm really passionate about the potential of robotics as well, of fusion energy, and I can just browse their publications and their reports.
00:37:18
Speaker
according to what are they connected to that then in what way are they how are they focusing it where should i turn the maybe i will just look at the website of some subsidiary instead of the of the main cooperation yeah in that case narrow down what's the most efficient way to approach potential employer who i know nothing about yet i'm only passionate about the domain they are engaged in.
00:37:43
Speaker
Yeah, this is the coolest and definitely the nerdiest approach to job searching I've ever heard. So four marks there. We should talk a bit about the other side of AI, which is prompt engineering. When you've got these models, what do you actually say to the thing? So the thing is, if you have a model that doesn't have that many parameters, a small model, then prompting and prompt syntax is really, really important. So if
00:38:10
Speaker
For example, there are some small models where if you like leave out a space at the wrong point, or where if you don't use exactly the prompt syntax that was also used in fine tuning them, then they will not produce useful output or the usefulness of the output is really decreased. There's this big disadvantage that a very small model has, for example, versus what we know from the big models and GPT-4, especially where you are really free to just prompt in any way.
00:38:43
Speaker
Recently, I've mostly been using the 30 billion size models, or 33, and they are really much better at that already. They provide people more leeway. They will be more accepting of free form input. Still, it's important when I build a pipeline to have some structure that makes sense for a prompter. If I know, okay,
00:39:07
Speaker
I will put lots of documents snippets in the prompt together with my query and then maybe i also have a conversation memory which is another very important component if you want to do conversation and not just have like zero shots i think it's called pointing.
00:39:22
Speaker
and just expect the desired output within just one interaction. Then the prompts sometimes they become somewhat complex. And then it's important to have some formatting which will let the large language model, so to say, make sense of what's the prompt. And then they have like...
00:39:41
Speaker
10 snippets from some report, and then they have what has happened previously in the ongoing conversation, and then there's the actual query.
00:39:54
Speaker
Maybe the next few days I'm going to experiment also with smaller models and see what's the complexity that I can still expose them to while expecting output that makes sense, output that's useful. And then, of course, they perform much better. So this is if I want to run stuff in parallel or if I just want to do async stuff, then using smaller models is very interesting and attractive. But I need to know what can they do? And what can those generalized models do? Where do I use specialized models?
00:40:24
Speaker
Options, options. Are you saying that with a smaller model, you have to be more careful about talking to it, but it still gives you very high quality results back? Potentially, yes. It depends very much on the fine tuning that the model has and then how it relates to my use case.
00:40:50
Speaker
where I really need to do some experimenting. Recently, as I say, I've been using 30 billion parameter models for just about anything, but it's a waste of resources for some use cases. If I just want to sum something up, I give it a prompt and I just want to use the language capability of the model to give me a summary of what I've presented it to be used in the next step of a chain of language models. Then
00:41:15
Speaker
I might imagine a smaller language model is already capable of that, but I need to test it and also see what works and what doesn't work for the use cases that I have in mind.
00:41:25
Speaker
Right. So do these language models, do they ship with like, here's how to talk to it? Yes. Or is that trial and error? Yeah, well, for the models on Huggingface, there's something called a model card. And every model should have a model card which lists its prompt syntax in the best case. There are sometimes
00:41:46
Speaker
Yeah, sometimes whoever uploads a model, there's not too tidy about it, and then people have to figure out, okay, I have this model, there's nothing in the model card, but maybe there's a link to the model that it was based on or that it is a quantization of. And then sometimes it's also a matter of, okay, this
00:42:05
Speaker
It just says this model uses the Vikuna 1.1 prompt. If you know where to find that, that's a good thing. In that case, there's the, I think,
00:42:19
Speaker
it's on GitHub, I think it's called Uber Booga or something like that, and they have just a folder of common prompting syntaxes. If you know where that is, you can look into that and it's fairly likely if a model card says, I want that prompting syntax, that you can find that prompting syntax in that repo.

Prompt Syntax Importance and Tools for Beginners

00:42:40
Speaker
This is from, I've just started reading an old sci-fi book called The Moon is a Harsh Mistress. Oh, I don't know that. And the lead character has been trained in this special language on how to talk to AIs. They've got a name for the language, but it's about Loglan or something, but it's really reminding me of that. Indeed, yes.
00:43:00
Speaker
That's superb. I think you've given me enough to get going, actually. Where's the first place I should go and download after this conversation?
00:43:15
Speaker
I would personally recommend to get started with LangChain. LangChain is basically this big toolbox which enables you to link lots of components together, whether it be those APIs from OpenAI, from Anthropic, from Google probably, also other providers.
00:43:34
Speaker
And then you can just start out. Maybe you get free credits from some of the API providers and can experiment there. They have lots of tutorials. They have descriptions of basic use cases, getting started page. So just start out with the simple use cases. And then once you figured out some of the components, then you already know you can run your vector database locally.
00:44:04
Speaker
download an embedding model to actually use the vector database with your own documents. Then, for example, if you're using Python, then you will need a component like PyPDF, but it's easy to install. It's just a Python package. Then you have on hugging face, you have those leaderboards of the open source models for the embedding models and for the open source large language models as well.
00:44:29
Speaker
Yeah. Then there's a llama CPP Python, I think is a very good package right now, which is the one that enables the mixed mode between CPU and GPU for the llama based models. And also there's some preliminary work on the Falcon model. That's the one from Abu Dhabi. That's also receiving their first optimizations.
00:44:49
Speaker
And you are actively blogging about this, so we can also check your blog for tips. Yeah, well, I'm writing on LinkedIn and I'm blogging about it. And yeah, so I don't really know what I will discover next.
00:45:04
Speaker
I'm eager to publish it, spread the message about what's possible today. Because basically, if you're a student at university, you can just try it out, get into the field, and there's lots of potential. There's basically lots of really early work yet to be done, yet to be discovered, published about, then talk about.
00:45:26
Speaker
connects to just about any domain that is in some way cognitive. So it's a thing that will go on for at least years, if not longer. And who knows where it will go.

Future of AI Speculation

00:45:40
Speaker
Yeah, I can totally see that shaping the next few years of the discussion. It's like a rich, ripe field. We're only really months into open source bedroom hackers exploring, right? Yes, indeed.
00:45:55
Speaker
Great fun. Well, in that case, we'll have to have you back in the podcast in a year or so, and you can tell us how far you've got. I look forward to it. Great. Toby, in the meantime, thank you very much for taking us through it. And thank you.
00:46:07
Speaker
Toby, thank you very much. That was truly enlightening. And I think I now know enough to achieve one of my real life goals, which is to download transcripts of all David Bowie's interviews over the years and get something that could rewrite my LinkedIn posts in a tone that I really respect.
00:46:27
Speaker
I suppose the flaw in that plan is it assumes there was one David Bowie when really we got a different one every few years. He was a great chameleon as they say. Over here we're much more stable than that and we'll be back next week with another developer lending their voice to the global conversation.
00:46:45
Speaker
So make sure you catch it by clicking like and subscribe and follow and notify and rate and all those good things. And you know, drop me a comment. As great as AI is, there's still no substitute for hearing from real people.
00:47:00
Speaker
That's the very raison d'etre of this podcast, isn't it? So, check out the show notes if you want to get in touch with me. Check out the show notes if you want to get in touch with Toby, because at the time of recording this, he is available to hire. On your marks, get set, go get him. And he also gave me a list of links that you'll want to look at if you want to learn more about this field. They're in the show notes too.
00:47:24
Speaker
All of which, I think, brings us to the end of this episode. I've been your host, Chris Jenkins. This has been Developer Voices with Toby Funkynol. Thanks for listening.