Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
How to Build AI That Actually Delivers with Dr. Arjun Jain (Fast Code AI) image

How to Build AI That Actually Delivers with Dr. Arjun Jain (Fast Code AI)

Founder Thesis
Avatar
0 Playsin 4 hours

In this episode, Dr. Arjun Jain - Founder of Fast Code AI - reveals why the AI industry's trillion-dollar bet on bigger models is failing, and what's replacing it.  Dr. Arjun Jain isn't your typical AI founder. After training under Turing Award winner Yann LeCun at NYU, working on Apple's secretive autonomous vehicle project, and leading Mercedes-Benz's robotaxi AI, he returned to India to bootstrap Fast Code AI with zero venture capital. In just two years, his company grew 8x by doing what the AI giants won't: charging for outcomes instead of software seats, deploying Small Language Models that outperform GPT-4 for specific tasks, and building agents that actually work in production.   

He shared this contrarian journey in this candid conversation with host Akshay Datt. From explaining why "we have but one internet and we've used it all" (quoting OpenAI's Ilya Sutskever) to revealing how procurement agents train by negotiating with themselves millions of times, this episode dismantles the AI hype and shows what enterprise automation actually looks like. Whether you're a founder evaluating AI vendors, an engineer choosing between foundation model labs and application companies, or an investor trying to separate signal from noise, this is the reality check the industry needs.  

What You'll Learn:  

👉Why scaling laws have stagnated and what test-time compute and reinforcement learning mean for enterprise AI's future 

👉How Fast Code AI captures "Salary TAM" (30-70% of revenue) through outcome-based pricing instead of traditional SaaS seat licenses 

👉The real reason AI engineers command $10-100 million salaries, and why this won't last as foundation models commoditize 

👉Why Project Athena (Mercedes-Bosch's multi-billion euro robotaxi venture) failed, and what end-to-end learning beats modular approaches 

👉How Small Language Models fine-tuned on company data outperform massive generic models at 1/10th the inference cost 

👉The "self-play" reinforcement learning methodology that makes Fast Code's agents reliable in production, not just impressive in demos

#DrArjunJain #FastCodeAI #AgenticAI #AIScalingLaws #EnterpriseAI #SmallLanguageModels #OutcomeBasedPricing #ReinforcementLearning #AIAgents #YannLeCun #BootstrappedStartup #IndiaAIStartups #AIServices #TestTimeCompute #FoundationModels #LLMLimitations #AIForEnterprise #ProcurementAutomation #AIEngineers #AutonomousDriving #ProjectAthena #SalaryTAM #AIInference #AIDeployment #BangaloreAI #AIConsulting #MachineLearningExplained #DeepLearning #TransformersAI #FounderThesisPodcast 

Disclaimer: The views expressed are those of the speaker, not necessarily the channel

Recommended
Transcript

The Allure of High Salaries in AI

00:00:00
Speaker
We have but one internet and we've used it all. Why are AI engineers getting $100 million dollars salaries? If you're spending say in the scale of hundreds of millions of dollars or billions of dollars in compute, why not pay, you know, like a person $10 million dollars or even $100

Introducing Dr. Arjun Jain: AI Pioneer

00:00:18
Speaker
million? dollars Dr. Arjun Jain is an AI pioneer who worked on Hollywood Blockbusters and Apple's self-driving car project.
00:00:26
Speaker
He is the founder of FastCode, deploying autonomous AI agents to save enterprises millions of dollars. Housecat is more intelligent than the LLMs. You show a child a weird animal and you say, and this a rhinoceros and probably it'll remember that, okay, this thing which has a horn on its nose is a rhinoceros. But our modern machine learning models cannot do that yet. From Bert to Chad GPT, how

The Future of AI Projects: Apple's Decisions

00:00:50
Speaker
did that happen? Does India have a chance to be one of the leaders in self-driving? Why did Apple shut down that self-driving car project?
00:01:07
Speaker
Arjun, welcome to the Founder Thesis podcast. um I've been looking forward to this episode because there is so much i want to learn about AI and you really are the right person to learn from. ah Give me a quick reverse pitch on why are you the right person to learn AI from?
00:01:25
Speaker
i know I'm putting you on the spot, but... So thanks Akshay for having me and I think I'm the right person because it's like AI is my bread and butter and that's also my training and pedigree.

Evolution of AI: Milestones and Technologies

00:01:38
Speaker
So I did my PhD in computer graphics, so machine learning for computer graphics. I did things for Movies, Adventures of 2011 by Steven Spielberg, by Peter Jackson. I did motion capture for Snowy the dog. I was then a postdoc with Jan Lekun, one of the pioneers of deep learning, won the Turing Award in 2018.
00:01:59
Speaker
And ah I've been doing machine learning since it was a thing. i mean, machine learning kind of took off in 2013. And then again, let's say when ChatGPT happened. So I think I've been doing this for the last one and a half decades. and This is why I'm the right person.
00:02:16
Speaker
So, you don't like you said, the the whole field of machine learning is like maybe a decade or so old. um And there's that saying that change happens slowly and then suddenly. And we've gone through that suddenly moment for AI where suddenly we have ChatGPT5. It just seems like yesterday that...
00:02:38
Speaker
ah people were looking at the first ChatGPT3 as a novelty, and today it's it's such a mainstream thing. ah Can you just tell me how this happened in just a decade? How did we get from um maybe like the IBM that ah IBM had that big deep blue or something like that. I remember it was supposed to be AI, but I don't feel like it was anywhere close to what we have today from that to where we are today. Can you take me through that journey?
00:03:10
Speaker
Yeah, so, but I mean, the IBM system was a tree-sert system and you essentially have like this combinatorial possibilities when you play chess, but still you can represent all possible moves as one giant tree. And then you kind of do pruning and, you know, branch and bound and so on.
00:03:29
Speaker
But, I mean, I don't think it happened. in Can you, like, layman that? what does that mean? Like, like you're saying that the IBM Deep Blue, which beat Gary Cosparov, I think, more than a decade back. You're saying it was a...
00:03:44
Speaker
advanced database with search facility? but like Yeah, so it was essentially like a heuristic search. So you, I mean, it's of course like, know, like combinatorial, it's huge. It's like, like 2 to the power 64.
00:04:01
Speaker
if you have just two possibilities per square of a chessboard. so But in theory, you can still kind of create like a giant tree of possibilities that okay, if I make this move, and then you make that move, and then I make that move, so that is one path in that tree. And then you can have all these different combinations laid out. I mean, it is very expensive, but still not ah as expensive as the systems of today.
00:04:29
Speaker
And what you then do is based on the current state of where you are and what that other person has done, you then decide your nest next move because you can then traverse all outcomes from there and know, okay, what will the final outcome be like if I make this move at the current given time? Okay. but So yeah, I mean, this is

Challenges and Opportunities in AI Programming

00:04:50
Speaker
more like this, yeah, search kind of algorithms.
00:04:52
Speaker
Like humans put their knowledge into a system and the system was able to map that knowledge and use it to play chess. That was what it was. Yes. And there was not really learning. It was not really learning from data. It was kind of calculating different outcomes and choosing based on that.
00:05:11
Speaker
Okay, got it. Okay. Okay. So, yeah, tell me from there. Yeah, how did we get Yeah, but I mean, people have been trying to do this, let's say, from the 60s. So there was this machine called Perceptron, and it had like these physical knobs, potentiometers. and each potentiometer encoded a weight. And there were these photo diodes where you could shine a light, say a handwritten digit, and then based on the weights in the potentiometer, so it was essentially like a hardware machine, but it was still trying to classify, let's say, handwritten digit into the correct class. So people have been trying it since then. And there is like the motivation comes from the fact that, you know, like sometimes it is very difficult to...
00:05:57
Speaker
like hand code problem. So a very simple example is, let's say, I i mean, you and i know how to write computer programs and I want you to write a program to ah to say, identify handwritten digits.
00:06:16
Speaker
And let's say we start with the simplest digit, which happens to be one. So like how would you do it? right Like ah for example, you might think of trying try to fit a line to it or count, okay, if there is a pixel below a pixel and if you have it till the end, it is a one and it would sort of work.
00:06:36
Speaker
But then different people write ones in different ways, right? Somebody has that thing, somebody just does this, and somebody has a different slant. So, but still, like you can still write a complicated if-then-else program. Oh, if there is a line like this and like this and like this, then it is a one. And also if it is just like this, then also it's a one.
00:06:55
Speaker
But okay, fine. I mean, you've made it like really complex, maybe thousand lines of code, 500 lines of code. But then imagine what would you do if you have to now do it for twos? right You see where I'm going with that, it becomes more complicated. You're now talking about like fitting curves and stuff.
00:07:11
Speaker
But on the other hand, let's think of a very orthogonal approach where we ask, let's say a bunch of people to write different handwritten twos.
00:07:24
Speaker
So let's say we have a 100 cross 100 image and we ask 10 different people to write two and they've all written these twos and we take these and we simply average these images out.
00:07:36
Speaker
So what do you think the average image would look like? I guess like, ah you know, this kind of a exactly circle and then a line below. Exactly. Maybe a little bit blurry because, you know, some people wrote a little bit up, somebody a little down. so So in some sense, that is already a data-driven algorithm, right? So this is your first mode. It's a mean.
00:07:55
Speaker
You're not learning parameters yet. So this is... I mean, you're not training any parameters. So trainable parameters are parameters for which you find value. So you start with random numbers and then you find values for that. But in this case, it is just the first mod of your distribution. In statistics, it's called the first mod, which is essentially mean the second mod is the variance.
00:08:15
Speaker
ah So here now, if I have this template, right, if I have this mean image, if you can now give me an image, I just need to compare with this. So how we compare is again like a different question. You can use a Euclidean distance, you can use a cosine distance, but we just need to now compare between what I had as my template, my data-driven algorithm and what you gave me.
00:08:35
Speaker
And this is one way to kind of of solve this problem in a data-driven approach. So this is where you know like the data-driven side of things start to come in. So here we were talking about like a very simple problem. And now suppose we were to go to like cats and dogs and elephants and horses, then the problem becomes like way more difficult. how does so how does this lead to chat GPT?
00:08:59
Speaker
Yeah, yeah. So, you know, so we started from there and then we realized the value of data driven algorithms because sometimes writing heuristics is not enough because it's such a complicated problem that by the time we kind of are able to articulate it in code, it's already very hard and we are not able to cover all cases and so on. So we said, okay, let's try the you know the data-driven approach. So we started off there. We did a bunch of different kinds of algorithms, the so-called support vector machines and you know graph algorithms. And they were doing okay. So the logistical regressions, the linear regressions and and so on. But this was like small models, so small number of trainable parameters, and you could train them on like a small amount of data.
00:09:49
Speaker
And because the parameters and data were small, you could you could train it on small hardware, small computers. But then then then this deep learning revolution happened. So there were a few of these believers who kind of kept at it. And earlier, like people were not able to kind of make these work.
00:10:08
Speaker
We can discuss why and how. But... in 2013, bunch of things came together and this explosion happened in deep learning with this paper called AlexNet from Jeffrey Hinton's group. So Alex Krzyzewski, Ilya and Jeffrey Hinton. And this is when Alex trained a neural network on this so-called ImageNet data on GPUs and it beat the
00:10:40
Speaker
older algorithms which were more of these handcrafted algorithms by a big margin. And this is when we knew that you know this is going to take over completely and this is where the neural network revolution started.
00:10:53
Speaker
And then, so a bunch of things happened. So this was 2013. By 2015, 2016,

Understanding AI Learning Mechanisms

00:11:00
Speaker
things started to kind of move into industry. It started with convolutional neural networks. So images were working well. Then it kind of lends very naturally to speech because speech is just 1D signal, images are 2D. So like the wave nets of the world. But it text was still you know kind of a little bit half handcrafted and then there were like advances in text starting from BERT models but then these BERT models kind of blew up when you know so there was like the attention is all you need paper and then there was the so there were these encoded decoder architectures and then
00:11:44
Speaker
ah chajipuity happened when these decoder-only architectures, because of the scale of data and compute on which they were trained on, kind of i you know started to behave in very very interesting ways. And this is kind of how this chat GPT revolution happened. So there were some believers and these believers believed that let's give all the data that we can get our hands on, so all of the internet data. and all the compute and train like really large models like 500 billion parameters, 1 trillion parameters. And to give you a feel for the scale of it, AlexNet was like 240 million parameters.
00:12:25
Speaker
Okay, I'm going to tell you what I understood and you tell me how much of it is right. Yeah, please. So the debate for the longest time was ah heuristic model versus data model. a Heuristic model is, like you said, if then, but very, very complex if then, multiple ifs, multiple thens, that if there is a half circle, if there's a semicircle and a line below it, that is two.
00:12:50
Speaker
ah And the data-driven model is you have thousands of twos in different formats which are given to a machine and then the machine is able to, through various algorithms, come to an understanding of what is two. therefore, when it is presented something which is a two, no matter when what way it is written, there is a high chance that it will recognize that it's a two.
00:13:12
Speaker
ah and this data-driven approach was not successful for the longest time until AlexNet. What changed? Why did it become successful in the AlexNet? So I wouldn't say it was not successful, but it wasn't successful at scale.
00:13:31
Speaker
So we were not able to train like very big models and for smaller models, it wasn't performing as great. So we were still using machine learning, but the input to this machine learning algorithm were these so-called techniques. handcrafted features. So we use things like the histogram of gradients. So you have an image and now, i mean, the input to a machine learning algorithm is always a vector. So we then convert this image into a vector representation using different methods. One of them happens to be, let's say,
00:14:04
Speaker
what I mentioned, the histogram of gradients. So you will convert your image or patches of that image into histograms. And then you can do some kind of a machine learning on top of it, let's say support vector machine. So these were common and those were, i mean, we still use data-driven algorithms, but AlexNet is when this kind of blew up and we stopped using these handcrafted descriptors.
00:14:26
Speaker
So we do not handcraft it, we let the model learn what it wants to extract out of it. How does machine learn? Yeah, very good question. So what does it even mean to learn? Yeah. Right? So all it means is you have data and you have targets that you want for this data.
00:14:52
Speaker
right So if you have some handwritten digits and for them you have some labels. So for a handwritten digit 2, you want the class as to or for a horse image, you want the class to be horse.
00:15:07
Speaker
right So you have your data, you have your labels and then you have some kind of a quantification of the output. So you have your model and the model is essentially like something, right? Let's assume it's a black box and then the model gives us some output.
00:15:22
Speaker
And the output is initially garbage, obviously, because you have not trained this black box. But then you have your target where you want to go to. And you have some kind of a measure of your happiness. Like how happy are you with this output?
00:15:37
Speaker
And what you want to do is you want to be happy. So you want your model to make you happy. So you then want to train your model in such a way that you are then eventually happy. And what does it mean to train?
00:15:50
Speaker
It means that this model comes with... these so-called parameters, they're essentially numbers, right? Weights. So they're all the same ah thing. So weights or parameters, which are essentially like, like a,
00:16:04
Speaker
let's say memory. So it's like a lot of random numbers that we start with. But then because we want it to be happy, we try to find okay which of these settings in our weight parameters make us happy.
00:16:17
Speaker
So from our quantification of happiness, we get to some good set of weights, some good set of these parameters. And that is essentially learning.
00:16:28
Speaker
i don't know if I made any sense. There's some sort of a reward function here to tell the model that I'm happy and therefore the model is striving to increase happiness. And so it will give like say hundreds of thousands of responses and the responses which are making you happy, it will ah prioritize the path it took to generate those responses.
00:16:52
Speaker
Yeah, so usually we use the word reward function in the context of reinforcement learning and we use the so-called loss function.
00:17:02
Speaker
I think like reinforcement learning people were maybe more optimistic than, say, supervised learning people. So it's the same thing, right? You want to maximize the reward or minimize the loss. But reinforcement learning...
00:17:16
Speaker
is, mean, so there is this notion of the loss being differentiable. So what we do is if the loss is differentiable, you don't need to try those hundreds and thousands of different paths through the model.
00:17:29
Speaker
So it's not, you don't really have to sample and search. You actually know the direction in which your parameters need to go because of the slope from, you know, calculus. We then calculate these so-called gradients, these slopes, and we go in that direction iteratively many, many times.
00:17:46
Speaker
Okay. So you said parameters are numbers. ah So essentially like any image which is generated or any text which is generated, there are first some numbers which are generated and then those numbers are converted into alphabets, group of alphabets or group of words or something like that Let's say the this is like the knowledge, right? The weights are essentially an encapsulation of all that the model knows. So, I mean, what we were talking was more like when we have like images and labels, it's more of a so-called discriminative problem where you have to discriminate.
00:18:18
Speaker
So in probability, you call it like y given x. But when we want to do generative learning, we want to do, you know, x comma y or y comma x, where it's a joint distribution, not a conditional distribution.
00:18:33
Speaker
But, uh, ah So, but yes, like the knowledge of even if you want to generate an image and all possible images that your model can generate is encapsulated in these weights, in these parameters, in these numbers. But then there is a very complicated function, right? On top of these. So there are the different layers each layer can do like different things that can be a transformer that can be a convolutional layer that can be a fully connected layer that can be a relu layer so these are essentially different functions which then tap into these weights that we have learned and then they can finally create an output okay okay okay got it so what is the big deal about attention is all you need
00:19:16
Speaker
um So, mean, we'll have to go a little bit deeper into, let's say, the so-called sequence models. So, earlier we had your re RNNs, so recurrent neural networks or LSTMs, long short-term memory ah models. And what they did was, so it has a set of weights, the trainable parameters, and you will get, let's say, the input to it, it'll use the weights, it'll create some output, and then it'll feed that output back into the same model. And it'll keep doing that. So for example, you give it a sentence and let's say it's at the word level, so it'll do that four or five times,

Transformers and the Future of AI Models

00:20:00
Speaker
and you have this representation, the output, and everything that we have given it is encoded in that output.
00:20:07
Speaker
You can call that output, let's say, a thought vector, right? So that's essentially an encapsulation of that entire sequence into that one vector. And now you see the limitation of this is that if you have a very large sequence, it's still contained in that one vector, in that one thought vector. But with attention is all you need. And it's also sequential. So yeah it's Markovian. Whatever is the current output depends only on my previous output and so on.
00:20:32
Speaker
so But with attention is all you need. what they So they were trying to fix this problem for a long time, but they couldn't find an efficient and elegant way to do it. But Vaswani et al., what they did was they figured out a way to do it using the modern transformers and you know the different norms that they have in this attention module. And what it means is now you're not just looking at the last transformers,
00:20:59
Speaker
input you're looking at all of them and you have different weights for how much you want to look at your surrounding let's say sequence tokens sequence values so this is what is the major paradigm shift between the older sequence models and the newer attention-based sequence models So you can now give your attention differently based on how important it is to everything in the local vicinity, not just the previous one.
00:21:31
Speaker
Is this like increasing the context window in a very layman-ish language? Yeah, so now that is the context. Exactly, you're absolutely right. So whatever you're now giving it, whatever the model is able to see in this one shot is the context of the model.
00:21:47
Speaker
Okay. And this made it possible for more natural sounding conversation to happen because it understood words within the context and it understood a history of the chat, like what was spoken in the previous sentence and so on. And therefore, ah it it allowed a like ah conversational ah intelligence to emerge. Okay.
00:22:10
Speaker
Yeah, I mean, it just became way more powerful, right? And once you threw like a lot of compute and ah and big models, it could learn to do all these like amazing things which we could never do before because earlier our models were tiny and then they would end on tiny data using tiny compute. So...
00:22:29
Speaker
Yeah, and then these emergent things were like separate. So we explicitly did not teach it to do certain things. Let's say you never taught it to do one plus one. But because it saw so much text and all it had to do was like predict the next token, the next word.
00:22:46
Speaker
It kind of learned that whenever you do 1 plus 1, you get to 2. So Ilya explains this. i mean, so a lot of people would ask, oh, but like if you train it in like such a dumb way of just outputting the next token, how do you expect it to do all these complicated things?
00:23:02
Speaker
And um I love what Ilya said. So he said um he, I think, ah was being interviewed by the NVIDIA CEO. And what he said was, imagine we're reading like a detective novel.
00:23:15
Speaker
And we reveal who the you know villain, the bad guy is on the last page. And if you're training a language model to do that, so now for the language model to be able to predict the name of the person who was actually the villain, it has to understand that entire story.
00:23:32
Speaker
So you see where we're going with that, right? So it actually gets to understand. The only way for it to predict what the name would be is to really understand the story. And this is where kind of understanding comes from.
00:23:45
Speaker
Okay. oh it does So does AI really understand in the sense of how humans understand?
00:23:55
Speaker
Or is it just ah like a probability machine? Like, does AI have sentience in a way? Oh, very good question. So, I mean, now first we have to maybe then discuss, like, how do humans understand? And, ah yeah, I agree. I don't think, like, AI really understands in the same way and, like, what understanding really is. But, yes, I mean, LLMs are, like, my advisor, Jan Lekun, he calls them stochastic parrots, and I'm also...
00:24:23
Speaker
from from like the same say school of thought. Then there are these more modern models called the so-called world models. So for example, ah will LLMs understand the law of physics? Will it understand, you know, occlusion? Will it understand a bunch of these simple things that humans do, babies do? So I don't think it does that. But then we're trying to build these newer models world models and so on which will probably do this better. yeah Generative video happens to be a very interesting, let's say intelligent model because if a generate generative video model, you you're again trying to you know let it finish it and if it learns that a ball has to drop and it learns these law of physics, then it has definitely learned something. It has some form of understanding.
00:25:14
Speaker
Okay, okay. So from BERT to ChatGPT, how did that happen? Yeah, so I think Burt to ChartGPT was mostly scaling laws.
00:25:25
Speaker
So scaling laws in terms of data, in terms of compute, in terms of the model size. So this scaling laws was like within OpenAI, they kind of came up with the yeah scaling law. Yeah, so there was ChatGPT2 and they kind of kept at it, right? So they believed in it and then they took these giant leaps of faith where they put a lot of resources, collected lot more data, trained lot bigger models, trained it for longer times on bigger clusters and we got these amazing models.
00:25:57
Speaker
Okay, okay. So scaling law basically says that if you... give a lot of data and compute power, you will get intelligence in a way. like That is scaling law.
00:26:08
Speaker
Yeah, so, I mean, there was a paper, the Chinchilla scaling laws, and it kind of showed a graph that, okay, we are here, and if we give it more data and more compute, and the model increases, we'll get till here, and you can then extrapolate that. But in my opinion, and I think they're kind of breaking now, or it has kind of almost been proven that they're breaking empirically. But then after the scaling laws were not holding anymore, we started to get into test-time compute, which is your so-called reasoning models.
00:26:39
Speaker
Okay.

Scaling Laws and AGI: A Debate

00:26:40
Speaker
ah So you're saying that ah our, ah like like say, AGI is what all the open AI and tropic, all of them are eventually aiming to reach artificial general intelligence. You're saying scaling laws will not get us there.
00:26:57
Speaker
Yeah, I think so. I think scaling laws have all and all i mean already stagnated, right? i mean So, Ilya, who is the man behind the models at OpenAI, he recently said at NeurIPS in 2024 that we have but one internet and we've used it all.
00:27:15
Speaker
So, I mean, the data is already... And even if you had a lot more data... And people are paying, like, a lot of money. So companies like ScaleAI, SurgeAI... I mean, these guys are kind of in that business. But again, I think we have exhausted our data. And also, even if we have more data, I don't think we are kind of... Those scaling laws are holding and and it's becoming asymptotic in terms of their performance.
00:27:39
Speaker
What does that word mean? Asymptotic means, you know, like, you have your... hyperbola or a parabola and you're kind of approaching a line, but it becomes like lesser and lesser. mean, it comes closer and closer, but the rate of coming close to it decreases. So it becomes like more and more flat.
00:27:56
Speaker
Okay. So you're saying like, and I think this is obviously like like something people can see also, like say a chat GPT 4.5 versus chat GPT 5 is not so different. Yeah. like Like underwhelming is, I think generally what people felt in terms of the upgrade.
00:28:14
Speaker
ah So that is because scaling laws are not working. And scaling laws are not working because there's not enough data or the approach itself has limitations. The approach also, right? So the value for money that you're getting for the bigger model. So now scaling it from 1 trillion to 1.5 trillion or 2 trillion model is not, it's maybe getting you, let's say 5% or 2% improvement. So even when you're doubling the model size, you're getting like a ah very small improvement.
00:28:45
Speaker
Okay. Okay. So then ah what are the, What's on the cutting edge? Like you spoke of world models and you spoke of test time. Just just help me understand what are these. I haven't really heard of these two ones. Yeah, so test time compute is essentially the reasoning model, right? So now you've trained your model, but now at test time, you kind of are giving the model like you're in a loop and you're saying, okay, do is this context good enough? So it's kind of recreating certain parts of its context. It is...
00:29:17
Speaker
And it's continuously doing that until the model is happy. And it says, okay, now I think I have enough information and now I'm ready to give the output. So you're kind of increasing. So the cost that you're paying is test time compute. Because at test time, your output is slower and you're using more of compute. But then the results are better because the model has the ability to kind of improve what it first got. It's like, okay, so it's kind of self-seeking mechanism.
00:29:43
Speaker
This is what, as a layman, when you turn on the thinking mode, yes that thinking mode, if you actually, you can watch the thinking happening where it is, yes the model is talking to itself about this is what the user wants and I should do this. And it's it's also correcting itself and saying, no, actually this

Building AI: India's Progress and Key Players

00:30:00
Speaker
is what. So that ah is improving the intelligence of the model by giving it some time to talk to itself in a way.
00:30:06
Speaker
Yeah. And correct itself and gather additional information. okay Okay. Interesting. And what is the world models? Yeah, the world models are, i mean, essentially, so they're different kinds of world models. And world models are essentially models that somehow learn certain laws through, say, design. So, I mean, they're like, very new and people are still kind of trying to figure out what is the best way to do it but one of the world models can just be say generated video right so if you train a model for generating video then you're kind of learning a world model within that model and so these are the so-called world models okay okay okay got it um
00:30:53
Speaker
Can you take me through the ah like the the steps to build a foundational model? Like, say, if India doesn't have a foundational model as such, if India wanted to build a foundational model, what are the steps? Who are the stakeholders that you need to work with? Who are the players in this? Like you spoke of scale AI as... one of the stakeholders in in this space. Just help me understand that space overall of building a foundational model.
00:31:21
Speaker
ah And like a related question is why are AI engineers getting $100 million dollars salaries? Yeah. and ah Sure. So, you know, like ah the steps. So India does have a few ah foundational AI models from ai for Bharat and Bharat GPT and so on. But yes, we're not there yet. The steps are first you need to collect data.
00:31:49
Speaker
So the good news here is that you don't need to label it in the beginning because it's unsupervised because it's so-called autoregressive meaning all it needs to do is it needs to predict the next word and we already know what the next word is because of the text corpus that we have collected. So we need to collect a lot of data And then we need to clean it. We need to make sure that it, so let's say we collect a lot of, we're talking about say text data.
00:32:15
Speaker
So we collect a lot of text data from the internet, but then we need to clean it. We need to make sure that it doesn't have, let's say, say, you know, like HTML tags or, you know, some garbage, or it doesn't have like, you know, some, you know,
00:32:31
Speaker
And you also then need to collect it and make sure that it doesn't have like say malicious things or inappropriate things in your dataset. So you have now let's say collected and cleaned your dataset.
00:32:46
Speaker
So then you do the so-called pre-training stage. And now in the pre-training stage what you need to do is you need to first like build like a giant cluster.
00:32:57
Speaker
And this cluster can span like multiple buildings. And across these buildings, you will have, let's say... By cluster, you mean like a cluster of GPUs? A cluster of GPUs. Okay. Yeah. And you will connect these. So each node will have, let's say, eight GPUs. These big GPUs, H200s, with like 100 GB of VRAM or 100 GB of memory on that GPU.
00:33:23
Speaker
And then you will collect ah you will connect all these different nodes, each node with let's say 8 GPUs with like very high bandwidth connections.
00:33:35
Speaker
And then you will, on this giant ah cluster, you will train a giant model. And this giant model will essentially predict what the next word is. And this is the most expensive stage of training a foundational language model.
00:33:53
Speaker
So, and then you will keep doing that until… This is still unsupervised, this stage which you're talking about? Yeah, this stage is fully unsupervised. Because all you need to do is predict the next word, predict the next token. And it's unsupervised because you already have this information and data and you don't need any labels.
00:34:12
Speaker
Are there companies which provide the data? ah Like you said, you need to collect and clean the data. Is that something a a foundational model company does in-house or can you? Yeah, so I think they do a a bunch of it in-house and then they also reach out ah to a lot of other companies, including the Scale AIs and the Surge AIs of the world to kind of help them with the data.
00:34:38
Speaker
So this, as a company like Scale AI would have crawled the web and it would have ah got all of that data on their servers, which they would provide to you. Yes. Yeah. I mean, I don't know if they do or not. I have personally not done this pre-training stage. So, but I would think so. Yeah.
00:34:56
Speaker
Okay. Got it. Okay. Okay. So this is the pre-training stage. Then what next? Yeah, so after this pre-training stage, you have a model which is able to predict the next word. So if you start it off, if you say, hi how are you doing? Or you say, twinkle, twinkle, it'll go on to create a poem and say, little star, how I wonder what you are. And that is all it is capable of doing because that is what we trained it to. But that's not very interesting for us. I mean, it's interesting, but it's not very useful to us. So then you do the so-called supervised fine-tuning stage.
00:35:31
Speaker
And in this supervised fine-tuning stage, what we do is we have this prompt context output triplets, right? So the input, the context and the output.
00:35:43
Speaker
And these you have like not too many, let's say a few hundred thousand such data points. And you then train the same model, but now it gets to see the question, ah the context, the question and the answer after that.
00:36:00
Speaker
So it now learns that okay if I have this context and given this input, I will have to give this output because that is how it was trained. And now it is capable of answering questions because it got to see all these different questions and corresponding answers. So even a question that it has not seen, if it sees that question, because now through this supervised fine tuning stage, we kind of changed the model. So we changed the model to kind of go in that answer direction. So even if it has not seen that question, it will still be able to somewhat give us the answer.
00:36:36
Speaker
So that is the second stage. Okay, okay. This is supervised learning stage. This is the supervised learning stage and this is called like the the post training. So post training has two or three steps, right? So this is one step and the other main step is essentially like preference. So we want to kind of align. so it's the alignment stage and alignment can be done in different ways, usually through reinforcement learning.
00:37:05
Speaker
So the model will produce, let's say, a bunch of outputs and an expert will choose which output it prefers. And we will then train our model to go in that direction. So we will align our model with our experts. And this is the most expensive thing part of training in terms of the data set. So in terms of compute, it is the pre-training, but in terms of data set, it is this alignment stage because we really need these experts to kind of tell us which one they like. And if you're dealing, let's say, with code, you then need programmers. If you're dealing with stories, you need poets and writers. So it becomes like very, very expensive and very tedious.
00:37:53
Speaker
Okay. ah So this is all like human effort, at least as on today, like the prompt context output, this is all like, human generated stuff which we are feeding to the model.
00:38:05
Speaker
And even at the alignment set, the experts, ah I'm ah guessing the companies wouldn't be hiring these experts, right? These would probably be some gig workers or something like that. Yeah, very good question, yeah. So these companies, i mean, the game they play, I mean, the data annotation people is that they have a large ah connection with like, say, these different freelancers. And these freelancers are...
00:38:30
Speaker
ah coming in based on you know the the gig, the problem that they want to solve. But like Turing, Scale aier AI, Search AI. Search AI, I think the company itself is like just say 100, 300 people, but then they work with like thousands of experts around the globe.
00:38:47
Speaker
Okay, okay. So the the expert part is provided by these companies, like you can plug into one of these companies to get the experts to help you with the alignment stage. Okay. yeah What next after alignment?
00:38:58
Speaker
Yeah, that's it. So after alignment, you might want to do some red teaming, um some, you know, yeah testing the model, if trying to break the model. So that is like the guardrail stage. And then based on if a prompt, it kind of gives the output that you don't want, you then want to create a bunch of those outputs, do the alignment again, or put some additional guardrails on top. So it might be then a cascaded approach, which the Gemini team has, for example, sometimes used where they then put this output into something else and if it classifies it as malicious it then just says oh i'm not supposed to output this okay okay okay ah so this reinforcement learning is and there's a term called rl hf what is the difference here
00:39:49
Speaker
Yeah, ah so reinforcement learning can be done with different feedback. It can be done without human feedback. It can be done using a reward function. It can be done using a bigger model. But

Economics of AI: Salaries and Model Development

00:40:00
Speaker
here, because we use human feedback, because we use a human to say which one is better or worse, it's reinforcement learning with human feedback.
00:40:07
Speaker
okay So this was like slightly older and now like DeepSeek made GRPO, but it's still with human feedback. But GRPO is ah what is the most popular and most efficient as of today.
00:40:20
Speaker
What is that? So GRP is just another way. So it still has these different outputs and you still choose one of them, but it just does it like, I mean, in the details of it, it just does it slightly differently.
00:40:33
Speaker
But yeah, it's still like human feedback and reason and reinforcement learning. They're still humans who are giving feedback, but yes you need to give less feedback. That's how it will be more efficient, I'm presuming. Yeah, that's correct. okay, okay. Okay, got it, got it. Okay, okay. ah Why do ah models have...
00:40:50
Speaker
different personalities, like the personality of a Claude, a Gemini, and a ChatGPT are like, they they have different personalities and you feel it. ah How does that, like at what stage does that come in, that like the personality of the model?
00:41:07
Speaker
Yeah, so, I mean, if you, like, understand the explanation, I think it's pretty clear that it comes at the last two stages. Because pre-training, there is no personality. You're just building it up. You're kind of encoding this internet information in its weights. But then, based on what it is, you know, supervised,
00:41:26
Speaker
fine-tuned on or the alignment, it gets that personality. And it's a decision that the company makes based on what they want it to look like. how to So the SFT stage essentially then kind of gives it a personality of the kind of words will it use, will it be verbose, will it be curt, and then the alignment stage again gives it a certain additional personality.
00:41:48
Speaker
Okay, got it. Okay, okay. So why are AI engineers getting $100 million dollars salaries? I think it's a demand and supply at play. So like if you want people who have done So the pre-training bit, right? So pre-training bit, I mean, it's so expensive in terms of the capex in GPUs, in terms of, you know, the data and so on that if you're spending, say, in the scale of hundreds and millions of dollars or billions of dollars in compute, why not pay, you know, like a person $10 million dollars or even $100 million, dollars but do not make mistakes and not, you know, and do that quickly because in...
00:42:33
Speaker
In you know games like these, it's often winner-take-all. So it makes sense to like invest in those experts and there are not so many of them. So there is that demand and supply. Plus, it's still marginal yeah when compared to the amount of money you're spending on compute.
00:42:52
Speaker
essentially like the cost of a mistake is in like 100, like it's in millions if you, and the cost of wasting time is also in millions because you're paying for power and GPUs, et cetera, et cetera. And opportunity cost, right? What if then your competitor does it before you, so then you're doomed.
00:43:08
Speaker
Okay. Okay. Okay. Yeah. Okay. Now I understand. um So what is the, ah like like, is compute

Hardware Limitations in AI

00:43:20
Speaker
and power a limitation today for ah the AI industry to grow further?
00:43:26
Speaker
like Like how many GPUs are available or how much power is available? Is that really a limitation or like, I want to understand that a bit better. Yeah, I think so. i mean, I'm not really um ah sure of the economics and like the, you know, the stockpile that NVIDIA has or the TPUs that Google has.
00:43:47
Speaker
But it's definitely, you know, like limited. And I think they're like trying to catch up very quickly. But I think the ambitions that different people have and people who still believe, let's say, in the scaling laws, I mean, they want to try like these bigger and bigger models. So I don't think there is a...
00:44:04
Speaker
Enough of that. But at the same time, it's expensive, right? So people are, like companies are laying off people so that they can put that money into CapEx.
00:44:15
Speaker
And NVIDIA's stock price just keeps on going lay higher and higher. Is the... ah Is the power and compute needed for ah the training or for the inference? So inference is essentially like the consumption of the model. Like if I'm asking ChatGP to make a LinkedIn post for me, that would be called inference, right? like Right, so exactly. Where is the it needed?
00:44:41
Speaker
at which of these two? Yeah, it depends, right? So, I mean, of course, training is super expensive and you need like... So for training, if you need those 10,000 GPUs, you need those 10,000 GPUs. You cannot do for less than that because you're training this giant model and you have to do certain things. But then when you're deploying it, you need lesser compute. You can choose on like how much...
00:45:07
Speaker
customers you want to serve, let's say concurrent customers. So you need more compute at inference if you're serving a lot of customers. And I think that is where people are actually deploying their CapEx. So you need, of course, a lot of compute for training, but then you need probably even more for inference if you have a lot of customers. But for people who don't have a lot of customers, they need lesser compute.
00:45:29
Speaker
Okay. Okay. Okay. Got it. Got it. Okay. um What's the difference between a large language model and a small language model? Yeah, I mean, it's both are just language models and a large language model, as the name suggests, is just like large. So it's, let's say 500 billion parameters, one trillion parameters, and your small language model can be, let's say, a billion parameters, which is still pretty large if you compare it to like older models. But let's say Lama 1B or Quen 4B, you can call them or the Microsoft Fi
00:46:05
Speaker
like 3BR like 3 billion parameters so when you contrast it with like 500 billion or 1 trillion parameter models these are smaller language models yeah so chat GPT is a 1 trillion parameter model we don't know but yeah I mean there is like the Kimi K2 which just came out like today and it beating the ah you know benchmarks so it's still a trillion parameter model okay okay okay amazing okay um So like what's the reason for someone to do a small language model? Why not just use a large language model, which is like a Swiss army knife, right? It can do everything. So why do you need a small language model?
00:46:46
Speaker
Yeah, for like efficiency, yeah yeah, you can use that large model, but then you pay more money. But if you have like a specialized task, you can maybe distill that large model into smaller model because it has to just do that one task and do it well. So it'll be way cheaper to then run that model. You can maybe run it even on on your own GPU. So for that reason, for efficiency, you want to ah use small language models whenever possible.
00:47:15
Speaker
So like if I was building something which needed ai in it, one option for me would be to take a API from open and pay based on token consumption. Or the second option for me, if it is a narrow use case, would be to create my own small language model. So you're saying the cost of training, deploying your own small language model and the cost of the engineering talent you need to create and deploy is worth it.
00:47:47
Speaker
ah It will still be cheaper than what I might pay to OpenAI. Again, it depend yeah, it depends on your scale, depends on your margins. But yeah, like if you are serving a lot of customers and you are able to have enough margin, then in the long run, yeah, it's basically a scale game, right? If you're serving like a millions and 100 millions, 100%, it will be way more efficient.
00:48:07
Speaker
What would be an example of a company where it would make sense for them to make their own small language model. Let's say a company that does yeah something very specific such as analyze invoices or analyze circuit diagrams or analyze something very specific.
00:48:30
Speaker
right So analyze, let's say, Jira tickets. So analyze something specific and lots of it. So that would be one example of a use case where a small language model would would make sense.
00:48:47
Speaker
But that's what I'm struggling to understand. Why would ah analyzing invoice have lots of it element to it? ah Lots of, sorry, what? Like why would there be...
00:48:59
Speaker
the a quantity element to something as narrow as analyzing invoices. and I suppose you are... ah And what is the quantity we're talking about? Like we're talking of 100,000 invoices or we're talking of 10,000 invoices? At what quantity would it be better to make your own model? Yeah, your own yeah like yeah so for example, you might spend $1 for a million tokens. A million, and let's say, so, and one page has, let's say, 600 tokens.
00:49:30
Speaker
One page and say 10 font size. Right? And input tokens is lesser, output tokens is more expensive. Output tokens can be like $10, $15 for those 1 million tokens.
00:49:42
Speaker
Now, if you are processing, let's say, 100,000, I think it'll get ah interesting for you if you're post processing like 100,000 invoices. So, for example, if Zoho is doing things with invoices, probably they're training their own SLMs.
00:49:56
Speaker
Oh, okay. Got it, got got So ah companies which

AI in Industry: Case Studies and Impact

00:50:00
Speaker
are selling ah like a SaaS solution which has ai in it, for them, it might be cheaper to have a small language model because they would have thousands of customers. So it makes sense for them to do that investment. Okay. I understood. Okay. Okay. Got it. Yeah. For example, Zoho, I don't know if you know your listeners would have used, but it has a method of taking a photo of your receipt. and for for reimbursements. So if they send that image to OpenAI, it'll cost them like 4-5 rupees. And Zoho kind of gives you a seat for 250 rupees a month.
00:50:38
Speaker
So it would not kind of scale if even if you do like 50 invoices. If one person does 50 invoices or 50 recits, then they are not profitable. So it is their own model and this is why they can give it to us even for like 250 rupees per seat.
00:50:54
Speaker
Okay. Okay. Okay. Fascinating. Okay. Understood. um ah Okay. So what has been, like, let's go through your journey a little bit more. You know, when did you actually start working on AI? You told me a Spielberg movie, the Tinted movie. what was What was the role of AI in that movie?
00:51:16
Speaker
That was your first experience of working with AI? and No. So ah what happened was, I mean, i mean After my bachelor's, i was in the industry for two years. I was with Yahoo, it was not very interesting. So i packed my bags and I spent a year in Italy. And that is for the first time when I was exposed to computer vision.
00:51:42
Speaker
images and that is where I was first exposed to this so-called machine learning. So my task was to essentially identify these logos in say F1 videos, right? So you have like broadcast video of an F1 race and the advertiser wants to kind of measure the ah ROI. It wants to measure how many times there's my Aliche logo or my, you know, uh,
00:52:11
Speaker
Tim or whatever Beaven logo was visible. And to do this, you again, because you cannot do it in a heuristic way, we resorted to these data-driven ways. So this is where I was first exposed to support vector machines and the SIFT descriptor and the HOG descriptor. And i mean, it kind of fascinated me. But then when I started my PhD at the Max Planck Institute in Germany at the Saarland University, it was apparent that machine learning is going to be the next big thing. So this was 2008 and all my friends wanted to kind of get into machine learning and different forms of machine learning. And I think I
00:52:49
Speaker
being you know like a bit like visual. I found computer graphics to be very interesting. And then my advisor, he also wanted to do things with machine learning. So I found the best of both worlds where I married machine learning with computer graphics. So my thesis was essentially Gen AI of 2012. And the title of my thesis was Data-Driven Algorithms for Interactive Visual Content Creation and Manipulation. So Data-Driven Algorithms for Interactive visual content creation and manipulation. So I did things with videos, with 3D models, with material properties of models. So for example, i did a project where the video on YouTube has a million views. It's called Movie Reshape, where given a video of a person, we can make that person look taller, shorter, more muscular in that video.
00:53:45
Speaker
So I have a Baywatch video where this person is running and we kind of make that person look more muscular. And this was all data driven. Similarly, with like 3D models. So for three models, if you have like two models, we try and interpolate between them. We try and extrapolate between them. We try and say extrapolate between ah ah a boat and an aeroplane. And the hope is that it looks like a fast moving boat.
00:54:12
Speaker
right and then yeah So ah three d modeling also has like material properties. So once you have modeled a car, then you need to assign material properties to say the windshield, which is like transparent and glossy. to the tires which is diffused and you know black and specular for the body, for the metallic finish and so on. So we again had a database of objects and based on searching for the object type we were able to assign different kinds of materials to these different parts.
00:54:49
Speaker
So these are the kind of problems I explored and what happened was because this video went viral Peter Jackson looked at it and he was like, oh, we need to get this kid here. And there was already a relationship between Veta Digital from Wellington and our institute here. One of my seniors was there. So he pinged me, like, ah do you want to come here? I'm like, yeah, obviously. So I spent six months and they were doing Hobbit and Peter Jackson thought, okay, we'll try it for Hobbit, but it's already too late for Hobbit. But then they were working on Tintin and I happened to be doing a lot of human pose estimation things and there is more cap. So I
00:55:28
Speaker
yeah like analyzing animation by putting in dots, so those infrared markers on the body, and then extracting the animation and then rigging another model and then transferring this, you know, animation onto that model. So this is mocap, motion capture. And it was working well for these two-legged creatures, but for four-legged creatures, there were problems of foot skating and whatnot. And then I kind of helped do this for snowing.
00:55:56
Speaker
Okay, okay, okay, amazing. Okay, then what next? ah So yeah, I mean, so then it was like, you know, like very clear that machine learning is going to be the thing. But I then joined NYU and this was just before AlexNet came out. So Jan was still not famous and he happened to be like my next door office neighbor. And I was working with Chris Bregler and Chris had just sold his company and Jan was kind of looking for people to work with. So I started to work with Jan and AlexNet happened and then we kind of knew that, okay, now the rocket ship has taken off. And it was then a very interesting time. ah People in my lab went on to create

AI in Entertainment: Success Stories

00:56:40
Speaker
OpenAI. So there was Wojcik Zaranba at NYU at the time and Wojcik is one of the founders of OpenAI.
00:56:46
Speaker
There was Matt Zyler who won the ImageNet competition the next year using his ZFNet, so Zyler-Fergus network. and everybody is like doing incredible things uh in in so uh some people i did my phd with so matthews he is now the founder of moon valley so moon valley is a company that is doing generative video so they are ah i don't know if you've heard of them they are using only licensed data and they are very you know friendly to creators and they are doing things in the right way and they raise a bunch of money and they are training these large scale models. So this was an interesting time because we knew that it is a time to kind of apply deep learning to X. Like you just choose a problem and you apply and it's just like how quickly you can apply. So we yeah, it was like very rewarding. And then, you know, all these different companies came looking for us and we could kind of call the shots. So like the salaries then and now you cannot match because of course for the super intelligence lab, it's different for, but the PhDs and postdocs graduating now, they don't get that kind of a
00:58:00
Speaker
ah ah say markup on what we got at that time. So we got like a bunch of companies looking for us and I chose Apple for certain reasons because they were building this autonomous driving division and they were giving me some very interesting work. So I joined them. But at the same time, I had this IP outside of and NYU in the startup Perceptive Code LLC. And then Mercedes came looking for us because we had published this NeurIPS 2014 paper on estimating human pose using heat maps. We were the first people to do that. And Mercedes wanted to use this within the cockpit of the car.
00:58:36
Speaker
So they wanted to install a few cameras inside the car. And based on that, they wanted to realize certain use cases. One of my favorites is ah in the EQS, we have removed the left-right button with which you choose the rear view mirror.
00:58:51
Speaker
So just by looking at the mirror, we know which mirror you want to adjust and you just then need to adjust the joystick. So there is no left, right button in that EQS. Very cool. Okay.
00:59:03
Speaker
Yeah. So, I mean, because of that, they came looking for us. We did some tests for them and one thing led to another. And then I decided to sell that IP to them. It also gave me a possibility to come back to India because we could do the handover either sitting in San Jose or in Bangalore because that is where they were building these two teams. I chose to come i choose to come back to Bangalore and here I am. you You didn't want to continue to work at Apple? Like like you were able to see the sign there that Apple will not actually go ahead with the autonomous driving car? Yeah, very interesting question. i think I did. I kind of saw those very early shoots of that happening. But at the same time, i mean, the opportunities then were like many, but I always wanted to come back to India. And I don't think, I think I'm a very lousy employee. So for that reason, i had two options, either I could be, a you know, entrepreneur or

The Quest for Self-Driving Cars: Challenges and Approaches

01:00:00
Speaker
I could be a professor, i could be in academia. And i did dance with academia. I was a professor at IIT Bombay. I graduated PhD students and it was an amazing time, but you still have a lot of opportunity cost and you lose, I mean, you leave a bunch of things on the table. So for that reason, I do what I do.
01:00:21
Speaker
Why did Apple shut down that project? The self-driving car project? ah ah Let's say, yeah, it was just not working out. I think they realized that they will not be able to get where they want to. And the Waymo and the crews of the world were like far ahead of them or the Teslas of the world were far ahead of them. So they decided that they should kind of call it sunken cost and move on.
01:00:49
Speaker
Okay. Okay. Okay. What's the challenging thing about building self-driving? Like what, as an outsider, I probably don't appreciate why it is so hard. Can you help me understand?
01:01:02
Speaker
yeah Yeah. So, I mean, self-driving really, really complex and it is because, I mean, to us humans, it comes so naturally, but firstly, it's safety critical, right? So you just cannot make mistakes if you do. you will probably need to shut down. So that is what exactly happened to Uber self-driving division and very recently we to the cruise self-driving division.
01:01:28
Speaker
And the story of why the cruise car met with an accident is again very interesting, but we'll maybe cover that later. But self-driving is... you know, where there is like a lot of nuance. There can be people of different kinds wearing different clothes. I mean it's just that the the variability of the world is just too much. And it's a very real-time application. You may have like 16 sensors and you have to decide within ah budget, within let's say 100 milliseconds for the car to be able to make decisions. And at the same time, you are doing like high speeds and there is like a lot at stake. And the world is constantly changing and it's very chaotic.
01:02:14
Speaker
So, I mean, it's really a very hard problem. And ah all of this computer is happening on the car itself. Yeah, I wanted to mention that. Exactly. yeah Absolutely. You have to do all that compute again on the car. So again, you're limited by the amount of compute you can carry both for the energy sake of things and even the space.
01:02:35
Speaker
that's a good things and so And also the price of the car, right? Because like what Waymo is doing is, there is um I mean, you cannot know, they do not know when will they be profitable. It's nowhere on the horizon.
01:02:46
Speaker
But because they are where they are, they're continuing to do it. And they have to eat pockets. How are companies solving these problems? So there are two ways to solve this. One is the so-called modular approach and the other is this end-to-end approach. And the newer companies, especially out of China, they are doing this end-to-end approach. And in this end-to-end approach, what you do is you have your... sensor inputs as inputs. So you have, let's say, 16 different cameras, maybe the GPS coordinates, etc. You feed that as input into your model.
01:03:21
Speaker
And as output, you simply get, you know, how much of your steering should you turn, how much of your gas should should you press and so on. So this is the end-to-end approach. And the modular approach is where you break it down into various modules, starting from perception,
01:03:37
Speaker
to ah scene representation, to prediction, to then, I'm forgetting something. So there is perception, after perception is localization. yeah So perception essentially means to perceive.
01:03:56
Speaker
So you need to perceive the world around you. You need to know that okay in your sensor, in your LIDAR, in your image, in your stereo camera, you know that okay there is this car here, there is this traffic sign here, there is the road here, there is a tree there.
01:04:12
Speaker
So that is the perception to perceive the world. Then after perception is localization. You need to, ah you've perceived others, but also you also need to localize yourself and you need to localize yourself in the world, right? And you can do this. So what people usually do and let's say centimeter level localization happens through these so-called LIDAR maps, these HD maps. So there are companies, for example, here maps or...
01:04:42
Speaker
ah Google Maps, they go with these LiDAR cameras through streets. They um they calculate, you know, a 3D map of the entire city, of a neighborhood. And now you have a rough location of where you are using, say, GPS. So a meter kind of an accuracy.
01:05:00
Speaker
You place your car there in the world and then you do a centimeter level alignment of your car by... ah like by aligning you the observation that you get from your current LiDAR cameras to that world that you had you know and calculated and preserved earlier, that HD map. So that is the second step, localization.
01:05:23
Speaker
The third is then scene representation. So you have perceived others, you have localized yourself, and now you need to represent this in some common 3D space where you have a car, you know, where different things are and now it's a nice world. Then the next step after you've done this is so-called prediction.
01:05:45
Speaker
So you need to predict where others will be. So you might ask why? And the answer is because it's a very fast moving environment, right? so And it is also somewhat deterministic, especially let's say highways. The expectation is that if a car is going straight, it will continue to go straight. So you can somewhat extrapolate it and you want to extrapolate it for the next few hundred milliseconds. So then there is that expectation. And then it's ah the prediction stage and then the planning stage. So now that you have predicted where they will be, you know, where you are, you need to plan a path based on certain, you know, certain conditions.
01:06:26
Speaker
certain parameters. Let's say you want to, you have a destination, you have a goal, you need to get there. But at the same time, you cannot exceed 200 kilometers per hour. At the same time, you don't want it to be jerky. At the same time, you don't want to hit others. So you have different choices of paths, but then you assign a cost to each of these paths and you choose from one of them.
01:06:46
Speaker
And finally, once you've chosen path, you then have to faithfully... go on that path. So then that is the control step where you need to kind of change that chosen path into the control. And and these are the different modules of the modular pipeline and there are pros and cons between say end-to-end and modular.
01:07:09
Speaker
ah Okay, like like it's not a settled debate on which is the super superior approach. Yeah, I mean, um people have opinions, I have mine, but I don't think it's settled yet. So Waymo, for example. yeah Sorry?
01:07:24
Speaker
What's your opinion? Yeah, I mean, I would side with the Chinese, I think. So I think end-to-end is the way to go. ah Because like you said, right, like, and because it's so complex, it's so complex that even breaking it down into modules and then still having these heuristics to kind of stick them into a great engineering solution is just not scaling.
01:07:47
Speaker
Even when you can do it for, let's say, one neighborhood or one city, scaling it to a different city is a completely different ballgame. And companies like Wave from Cambridge i mean they claim that way we may may we may not be the first people to solve it, but we will be the first people to scale it to 100 cities. And I think scaling is easier in the end-to-end approach.
01:08:07
Speaker
But there are advantages in the first approach because you can debug it better. You can have teams for it better. right and You can debug at the at the interface of every module. You know what the output is of perception, you know what the output is of localization, of scene representation, of planning, of control. But you cannot do it if it's end-to-end. So debugging that is harder.
01:08:29
Speaker
Also, then you cannot have like these multiple teams, doing, ah you know, so engineering is harder. But it is also simpler in the sense that you don't have to do any of that. You just want to treat it as a black box and you're hoping that you're putting enough compute and enough data into training this model.
01:08:46
Speaker
Okay. ah In the ah modular approach, the car actually has to carry the LIDAR data for each city inside, like like you need a lot of local storage also for it, etc., etc.
01:08:59
Speaker
But in the end-to-end approach, it is essentially like a LLM in a way, like like it is a model which has been trained. it depends. i mean, different people are doing like different types of end-to-end approaches. But yeah, one end-to-end approach could be where you do not have any LIDAR and just based on, say, your perception, your different vision cameras, 2D cameras, you want to then plan the next course of action.
01:09:26
Speaker
Okay. Okay. Okay. Interesting. Currently, how many companies are trying to do self-driving on their own? Like would a Mercedes be trying to do self-driving on their own? Because eventually, I think all companies see that as the future, but some might be doing in-house voices. Some might be working with ah third parties to adopt that technology.
01:09:50
Speaker
Yeah, so there is self-driving and like self-driving is say level 4 autonomy but then there is level 3 and level

FastCode's Vision and Growth

01:10:00
Speaker
2 and level 1. So level 4 is where you do not need a backup driver and you do not even need like a steering wheel in the car and the car is capable of ah ah driving within a geofence. Level 5 is like the dream. Level 5 is where It's like humans and it can just go anywhere, wherever there's a road, it can go. But level four is the so-called robotaxi, right? Where it's a full-fledged robot and it comes and picks you up and you at that point do not really need to own a car because you can summon it whenever you want and it's just a robot. But
01:10:36
Speaker
Even if companies are not looking at level four autonomy, and by the way, Mercedes has tried and it has kind of tried very hard and failed at it. at A few years ago, it was called Project Athena. It was a joint venture and you can look it up, it's public information. And I was a part of the team. It was like very interesting time. So it was Project Athena joint venture between Mercedes and Bosch, burnt a few billion dollars and then they decided to call it Sunken Cost and move on.
01:11:03
Speaker
So, though people are may not, I mean, maybe not everybody is looking at like level four autonomy, but everybody's definitely looking at level three. And level three is where you can get your hands off the steering wheel and hands, eyes off the road. But then the car has to know when to, you know, wake the driver up. And at the same time, the car needs to know if the driver is capable of being woken up, if the driver is alive if the driver is active if the driver is you know and and the and the car has like say a minute to kind of give the control back to the driver and the car needs to know when it uh it needs to give the and by the way like in tokyo there is like wave doing a bunch of tests and there are like other companies doing tests so i think there are these tests happening all around
01:11:53
Speaker
Okay, fascinating. Okay. Do you think, ah like, how far behind is India? Does India have a chance to be one of the leaders in self-driving? I don't think there's much happening in India in terms of self-driving.
01:12:10
Speaker
Yeah, in India, I mean, there are these smaller players, but none of these big players are putting their weight behind, let's say, trying these futuristic deep pocket technologies. i think we're still on the, you know, below the line. I think we're not still on the scarcity side things and not the abundance. And I think we're not kind of taking these big risky bets just yet.
01:12:38
Speaker
and I sometimes joke that maybe it's easier to do it in India because we just have to decide if there's something in front of us and we just have to, you know, go a little bit and then break, go a little bit and but and then break. Maybe, you know, no lane markings to deal with and so on. So maybe it's easier. But yeah, I mean, I'm just joking and and being nasty. Okay. Okay, okay. Less rules to follow in India, so therefore easier. Okay, got it. Interesting. Okay, okay. But I mean, jokes apart, I mean, India is really hard, right? I mean, there are, it's very chaotic. There are two wheelers, there are some animals, stray dogs, cows. I mean, it's really, really complicated and chaotic. So India is definitely harder than other geographies. Even like, so the easiest is a highway because on a highway, the driving is like more deterministic, but Indian indian highways are still not as deterministic. And the, you know, like it's all driven by economics. So if there was enough, let's say demand, and if there were like people really wanted it, you could still have it. But I guess you, you know, the human labor is still not as expensive as in developed worlds for people to invest in that kind of technology in a big way in India.
01:13:46
Speaker
Okay, interesting. okay so Once you left Apple and came back, then what next? yeah so where We had like a handover, so we had to kind of but So the model that we brought from ah Perceptive Code was occupying 6GB of GPU memory and we had to fit it into like a small 2W Xilinx ZU3A FPGA with 110 DSPs and eventually we had like 300KB of weight so from 6GB to 300KB so this journey itself was like, you know, like
01:14:19
Speaker
three years and then once we did this and over because our acquisition was tied onto certain milestones and the final milestone was it being released in the car and we did that in 2018 and from 2018 i kind of after having sold this company i was not sure what to do i took some time to kind of figure that out in the meantime i was teaching and i was doing just things by myself And then in 2020, COVID happened. And then by then, i think I had enough energy to kind of do my next thing. ah we i
01:14:53
Speaker
kind of started doing consultancy in a very natural way because people kept calling me and they're like, Arjun, can you help me with this? Can you help me with that? And at the same time, a lot of bright people who I was teaching or otherwise was like, do you have interesting things to work on?
01:15:07
Speaker
And I realized that I can only scale my time so much, like even if I, yeah you know, no matter how expensive I make it. But I can still you know scale by selling other people's time. so i And I also realized that I was good at identifying talent, attracting talent, nurturing talent, retaining talent. So I started to kind of do this in a small way.
01:15:31
Speaker
But I think we really took off two years ago. I think I turned 42 years ago and I think I kind of had this midlife hitting me and I think I channeled all that energy into kind of and growing fast code. And I think that is when we grew from like a six to eight member team to this 45 member team that we are today. ah We grew like 8x in the last two years and just maybe like 1.5x before that, let's say from 2020 to 2023. What do?
01:16:03
Speaker
Yeah, so for at FastCode, we ah solve different problems for enterprises in speech, text, and ah vision. And we have, so we call ourselves ah ah platform led AI native company. So we have a few platforms and we bring it to our enterprise customers so that they can hit the road running so that we can kind of ah enable these AI use cases for them in a very efficient and quick way.
01:16:38
Speaker
And our core values are excellence, integrity, and innovation. So excellence for us is how you do anything is how you do everything. If we do something, we do it really well, or we don't do it because life is too short. Integrity is really important to us. I think it resonates with everyone at FastCode because it leads to long-term relationships. It leads to like clear communication. And I think it just leads to like lower stress. And innovation is something that we kind of ah hire for We create conditions for it, but it's a creative process. It happens when it happens, but it's our North Star. We celebrate when it happens. So these are our three core values and we work with our customers for their enterprise problems. or
01:17:25
Speaker
on you know So we are a horizontal company where we have this data platform and now we have this reinforcement learning pipeline and we build custom environments for our enterprise clients. But then we use reuse our reinforcement learning pipeline to solve certain problems that can be solved using reinforcement learning agents.
01:17:49
Speaker
ah So when you say you have a data platform, what does that mean? Yeah, so we've learned like the best from autonomous drive-in and we now bring these learnings to all sorts of data modalities, be it like insurance data, claim data, be it speech data, be it any kind of data that you might have, structured data, unstructured data, documents. So so the data platform starts with where we have data strategy and readiness platform, the readiness framework where we kind of come in and we have a framework for identifying how ready we are for the AI journey and if we are not ready, what do we need to do to get ready? How do we then version our our data? How do we build it such that we can always find provenance where it came from? We can trace the roots of it.
01:18:46
Speaker
We have audit logs and so on. And then the data loop. So the data loop is how you can then enrich this data, how you can make it queryable, how can you label it in an efficient way using auto annotation. And then the next step is auto QC, so auto quality check. So this is what our data platform is. So all the way from the data strategy to the auto QC.
01:19:11
Speaker
Okay. So essentially companies which want to go on the AI journey, It starts with data first. That data will train an agent or train an SLM ah to help in execution. So so that is the data platform. And then you said you have a reinforcement learning platform. What does that mean?
01:19:34
Speaker
Yeah, so the reinforcement learning platform means what do you need to do? but So agent is essentially an LLM with tools. But then what do you need to do to make this agent... do well for your task. So either you can just hand code it, you can write a bunch of prompts for your and LLM um and it can do a certain task. But then what is better is if you can train this LLM using your data. But then you cannot like unleash it on day one ah in the wild, in the real world. So you need to create a simulation, you need to create an environment. And that environment, for example, for a procurement setup may look like a setup where the LLM has access to different tools such as write email, read email, search email, write a slack message, read a slack message, all of those different things. And then we create these different personas, these different vendors, and then these different, and then there are these other personas, which are again, just LLMs, let's say these different departments within the organization. And then these different organizations, different, and then we basically create, so the procurement usually usually in these large enterprise companies, they already have a procurement setup. We just copy that procurement setup within our environment, within our sandbox. And then we, instead of humans using that procurement setup, we get these, ah you know, LLMs to use that procurement software. And then these LLMs do something to the software, it goes and clicks a few buttons and we essentially rig them up with, say, the with prompts, with personas, where LLMs are essentially then replicating what humans would be doing. And we try to keep it current and we try to improve the persona of every actor in this environment by learning from what happened in the past in the company.
01:21:27
Speaker
Right? And then they kind of start to play a game. The the department sends out an ah RFP, the different vendors bid on it, and then this email conversation starts. They send, let's say, an um a quotation, then the quotation has different terms and conditions. These guys go and say, the department goes and says, oh, you haven't catered for these and conditions. They again send an email and so on. And the goal is to be able to do this automatically. And when it has trained enough over time, we then kind of slowly bring it back in a safe way in the real world. For example, it might then create a draft message, a draft Slack message, and then a human has to test it and send it. But then again, if a human made changes to it, we will then record it and then again bring it into learning and continuously learn and improve the agent.
01:22:15
Speaker
Okay. Okay. So you are here training an existing LLM, like say a ChatGPT would be getting trained in your reinforcement learning platform.
01:22:26
Speaker
We cannot train ChatGPT because ChatGPT is a closed source model, but we train other different open source LLMs, let's say the QN4B or the DeepSeq kind of models.
01:22:39
Speaker
What is the difference between a closed source and an open source model? So a closed source model is a model where you do not have access to the weights, the parameters that we have learned, where it encapsulates the knowledge. And an open source or a rather an open weight model is where they have released the weights, with where they have released the model. So you have access to that 4 billion model, to that 1 trillion model, then you can download it from the internet. And if you have the compute, you can even run it on your own computer.
01:23:10
Speaker
Okay, okay, okay. So these would these would be like a couple of billion, these open source models, like say a DeepSeq. Yeah, yeah. So even like the SLM is like a 4-billion model.
01:23:22
Speaker
So, but a 4-billion model, so 4-billion parameters, float 32 is 4 bytes. So, you know, 16 gigabytes of just weights. Then you have these so-called activations. So you still need like 80 100 GB kind of, ah you know, GPU to be able to run these so-called small models.
01:23:46
Speaker
Okay, okay, okay, okay, got it.

Custom AI Models: Tailoring Solutions

01:23:49
Speaker
So essentially, you are building custom models for companies, you're taking an open source model, training it for that specific use case. ah And do you er strip away from it stuff which is not needed? Is that also part of it? Or you essentially just train it on what is needed, and then you deploy it?
01:24:08
Speaker
So stripping away is slightly harder because you would need to, yeah. So, but no, we do not. We kind of just fine tune it. We train it for the task at hand.
01:24:20
Speaker
So it just learns to do this task probably much better than what I mean, what it started with, definitely, but what even chat GPT would do right out of the box, and you have no way to kind of train chat GPT anyway. So in the end, it would end up doing this task better than chat GPT itself.
01:24:40
Speaker
Okay, understood. One limitation which I have heard about... ah ah like a foundational model is essentially that it's not a learning model as opposed to human intelligence. So a human being is constantly learning. If you are making a mistake, then you are learning. And in the next minute, you are more like like you've learned more than what you knew a minute back because of that mistake you made. Whereas a chat gp only learns when Say a 4.5 is upgraded to 5. So all that learning is only implemented in that once in six months or once in a year kind of a thing.
01:25:18
Speaker
Is it possible for having that kind of that true intelligence where it is learning on the go constantly? Is that technically feasible? So but you can think of this, you know, setup that we have with these agents as true learning because it is learning with every mistake and we're not really waiting forever to collect this data. It kind of continuously learns and evolves based on whatever signal it is. Like all the signals are. Sort of. I mean, it's still batched, but it is more real time the cadence of let's say a few months.
01:25:54
Speaker
Okay. like It'll be like ah once a day or once week. Yeah, we do it overnightly. yeah Okay. Every night. Okay. Okay. okay Fascinating. ah Okay. so So this is closer to what natural intelligence is like. It says that it's constantly evolving. ah Yeah. but So my my question is, is it is this feasible for a foundational model to do?
01:26:16
Speaker
And if not, then why is it not feasible? Is it that it'll need a lot of compute or what is the reason why why aren't foundational models doing this real-time learning? One, you need signal, right? So where is the signal? You got some output. How do you know what it should have been? From what the users are, like the users would be giving signal, right? Like if I'm asking a model to generate a post and let's say I want to do a LinkedIn post, want to generate an image and say, okay, I don't like this, I do this, do that, blah, blah, blah, all of those changes. ah So that would be significant. Yeah, but then you're believing the user. But yeah, like ah what what is more direct is you might have seen in chat GPT, it gives you two responses, tick between one of them.
01:26:57
Speaker
So that is like direct data that we are creating for openaii So it'll definitely take those. And then, yeah, I mean, it's really that huge model that you have to train on that huge cluster. And you need enough data to first, you know, accumulate to be able to do that. Our models are smaller, so we can still run it nightly. But yeah, I mean, it's basically, theoretically, you can, if you got like continuous signal, you can still continuously train it.
01:27:24
Speaker
but But there is a like a resource constraint if for a foundational model to have real-time learning happening. like It can be extremely resource intensive. and And it'll also be kind of, you know, like there have to be like so many tests and bounds that you perform on a model once it is trained. You cannot just unleash it in the world. What if it, yeah, what if it learned something wrong? I mean. Okay, there's a safety yeah issue. Manual tests, red teaming, trying to break it, you know, teams of experts, all of that.
01:27:55
Speaker
But isn't that the path to AGI? like Like wouldn't that be what you would call AGI if it is real-time learning? I would say even that is not enough because an LLM at the end of the day is a stateless model. Stateless meaning it doesn't remember anything.
01:28:12
Speaker
All it knows is its context. And if it has memory, that is again something external that it can tap into. But it doesn't really remember it in its way. It is not wired into that, right? So it is stateless. So we need probably something more than that. And it has no knowledge of the world. It has no knowledge of the rules of physics. So it needs to also have knowledge of the world in some sense. So it has a lot of limitations. Also, it cannot do like few shot like we humans do or one shot like you. Yeah, sorry.
01:28:45
Speaker
You said few shots. I didn't understand. Few shots. So few shots and one shot. What I mean by that is, like we as humans can learn from very few examples. You show a child weird animal and you say, you know, this is rhinoceros and probably it'll remember that, okay, this thing which has a horn on its nose is a rhinoceros. But yeah, yeah our modern machine learning models cannot do that yet.
01:29:10
Speaker
Okay, they need ah like a 10,000 photos of rhinoceros before they know that, before they can recognize. Yeah, I mean, at least hundreds and thousands of such different, because it might be able to then learn that for that one pose. But for a child, you show it from top, you show it from the other side, you show it from front, it'll kind of still know that it is that same object. But our models cannot do that yet.
01:29:33
Speaker
Why is that? This is so fascinating. Like how you would like how advanced human intelligence is, you know, like we possibly before these models came about, people didn't really appreciate how extraordinary our intelligence is.

AI Understanding: Language and Beyond

01:29:49
Speaker
and not just human even like cat intelligence right I mean a cat is probably a house cat Yan says that a house cat is more intelligent than the LLM so um yeah i mean I don't know and like if we knew why humans are like that we would try and kind of encode it within our models but so which is where that you keep saying that it has to know physics and I wasn't really a able to relate to why it needs to know physics but possibly that's why it needs to know physics like one of the reasons yeah or if it needs to kind of answer certain things it needs to i mean basically knowing physics is knowing how the universe works and if it knows how the universe works you ask it sort a certain question you say you know i dropped the ball it'll know that okay it went to the ground so it can then behave based on that instead of If I gave it like if you just base it on training data and if in training data the ball always went up, then it will just say the ball went up.
01:30:47
Speaker
Does a model know language? emphasis being on the word know here, does it know language? and Again, you know, you have to define what it means know, but the language models are currently the best models out there to define language.
01:31:05
Speaker
So they definitely know language in some sense. So they do much better than everything else than the, you know, the different grammars, the, you know, the Chomsky hierarchies, all of that. So language models are still by far the best models. It's just that they are not like, ah interpretable or they're not like very easy to understand what they are doing in that model. But yes, they're definitely language models.
01:31:31
Speaker
yeah So just by reading about physics, it wouldn't know physics, you're saying it has to kind of like get the physical experience of it to it'll know. ah I like what Yan says, it will be stochastic parrot.
01:31:45
Speaker
So it'll kind of say similar things. But then will not really know it It will be able to like kind of give that as an output, but it will truly not know it. It will not be able to transfer the knowledge into. So if you tell it that yeah yeah a ball is falling, will always just know that a ball is falling. It will not know that, okay, ah glass will also fall.
01:32:09
Speaker
So it will not be able to kind of generalize and distill that knowledge. okay Okay, understood. Okay, so ah coming back to what you're doing at FastCode, so what are some use cases that you have worked on for for these platforms, like the reinforcement learning and the data platforms?
01:32:27
Speaker
Yeah,

AI Applications: Real-World Implementations

01:32:28
Speaker
so ah for reinforcement learning, i think um this procurement is one of our leading products. It's one of our flagship products. So we do procurement transformation for various enterprises, especially procurement-heavy enterprises. So EPC companies are one of them. These national oil companies are another, and we are focusing a lot on the MENA region for the same but reason.
01:32:54
Speaker
so So this is one of our products. we Like replacing human effort, like say what a human being might be evaluating bids and negotiating with vendors or asking for clarifications. all of that effort is now being done by agents.
01:33:12
Speaker
Yes. yes So for example, in insurance, so this is where we train another agent, is where what this team needs to do is for so it's a real estate company, and the real estate so this insurance team belongs to this real estate company. And the real estate company is in the business of, let's say, leasing real estate and they lease real estate to different types of customers. And now a customer comes in and they want to lease this building or this floor or this office space. And then this insurance team is supposed to kind of come up with ah the different insurance clauses that we need to put in the contract. And they have data for the last many decades, right? So how do humans do it? Humans would look at prior data, they would have some ah rules of thumbs, for example, oh, are you an F&B outlet? So if you are, then probably have a kitchen. If you have kitchen, then you have fire. If you fire, then you have to have this kind of an insurance clause within your contract.
01:34:15
Speaker
And then, oh, this is what we did in the past. And then, so we have trained, ah we've created an environment and we've trained a reinforcement learning based agent for that.
01:34:29
Speaker
Okay, guard it got it. ah You could have ah sold this as a SaaS, you know where you could have, let's say, ah a procurement solution competing with the SAP. You could say that this is the AI native version of SAP or whatever. Why go through this route of a more services approach rather than a SaaS approach?
01:34:52
Speaker
Because every, so we do have a platform. So this reinforcement learning pipeline is a platform and the nuts and bolts to build the environment because we can build these environments very quickly is a platform. But we need to go the Palantir way, right? Because every company has a slightly different way of doing procurement. Oh, I'm not using Salesforce. I'm using Jack Henry. I'm not using PostgreSQL, I'm using, you know, Milvus. So like you have to kind of, so we have these different connectors, but then if you kind of expect a company to be able to do this, they might not be able to do it right. So they take our toolbox and then we have our forward deployed engineers who sit within their team and then they rig it up for them. And then it needs to learn within their environment. and It is not that, you know, what we have trained for, a company A will directly work for company B. So this for this reason, we have to chosen to go this platform route rather than a pure SaaS route.
01:36:01
Speaker
How do you see ah enterprise software spend? And by software spend, I'm including spend on AI. How do you see that panning out a couple of years down the line? Will it be largely like this, like custom deployments built for purpose built for them? Or will they be buying off the shelf solutions like a SaaS um I mean, there will be as per certain smaller, let's say, outcomes. But what they're looking for is ah ah ROI, right?
01:36:34
Speaker
Return of investment. So if you can save them some money, for example, they have the, so in our procurement use case, they usually negotiate, they put a lot of effort when it's like,
01:36:47
Speaker
a multi-million dollar procurement contract. But if it's just like a $10,000 contract, they kind of just kind of, you know, like don't put too much effort into it. But if our agent is able to kind of put efforts into that and is able to negotiate a better deal for them, then this is like pure ROI that they would be willing to pay. And we do like outcome based pricing. So if we save x for you, give us 0.1X or even like point zero five x So
01:37:18
Speaker
i think um Like companies in their transformation journey, all they want to do is like reach this ah ROI. So whatever gets them that ROI, they would be happy to do. And SaaS is like, it's like solving like this one problem properly, right? So SaaS is So we might be using SaaS for certain things. ah For example, we might want to use authentication from one provider, right? We don't want to write our own authentication. So we would use APIs for that. We might use a database from someone else. We might use the cloud storage from someone else. so all these are essentially SaaS, but we are then using this SaaS and then building on top of it and creating value.
01:38:01
Speaker
Okay. Like the... ah The salary TAM, I think that's the term a lot of people in AI are using, that like the total addressable market is the salary being paid to employees. So that salary TAM is being captured through AI businesses.

Pricing Models in AI: Outcome vs. Value-Based

01:38:18
Speaker
The outcome-based pricing, is that a norm when you're selling ai to enterprises? like No, I think ah we like that because the margins are high. You can do like time and material. You can do value-based pricing where you just have a particular value. And then outcome-based pricing, I mean, if you have the confidence, then it has the maximum risk to reward ratio.
01:38:40
Speaker
So like an Infosys would not be doing this, right? The outcome-based pricing for, i mean, they would also be selling AI to enterprises, but probably. I don't know. I mean, I'm not aware, but yeah. I don't know, maybe they're doing it for certain things because for certain things, it's very clear that you can, you know, and it's like winning for both the parties because the other parties like, yeah, I mean, it's like, you know, finding a 500 rupee note and giving it to you and just saying, you know, I found it, but you need to give me 100 back if I give it that to you. So it's like a no-brainer for you as well.
01:39:10
Speaker
Are there any hidden pitfalls of doing outcome-based pricing things that founders need to be aware of if they're considering going down this route of outcome-based pricing?
01:39:20
Speaker
Yeah, I mean, you need to be kind of really sure about what you are promising. ah You may want to do value-based pricing with your first few clients. And then when you have a well-oiled machinery, then you might want to resort to when you have some data points to kind of, depends on your risk-taking ability, depends on if you are bootstrapped or venture-funded. So we are fully bootstrapped. If you have venture money to burn, maybe you can directly go to outcome-based pricing. I don't know.
01:39:48
Speaker
It's a personal choice. How do you do value-based pricing? what So value-based pricing is essentially the value that we believe we create for the customer. We may not really know the outcome, but we feel that, okay, this will save them this much. So then we go and ask them. So the first one is like cost plus like, okay, we spend so many developer hours on this. And so we charge this much plus this percent.
01:40:15
Speaker
But then Because we bring in so much expertise and we have this niche and like there's no competition or very little competition for this very particular niche. So we can then go and do value-based pricing because they have no other option. So then they're like, okay, are we willing to pay this? So we then try and estimate what the customer would be willing to pay based on the perceived value of what we bring on the table.
01:40:41
Speaker
Okay. So value-based pricing is like kind of rule of thumb that, okay, if I'm doing this for a company, they would probably be saving this much. If they're currently spending this, I would probably do this. And so based on that, you kind of are guesstimating and and then giving a pricing to them.
01:40:59
Speaker
That's correct. Yeah, okay exactly. And outcome-based is where you have real data, like you have the exact data of what they spent on let's say buying uh let's say the canteen like how much were they paying to the canteen vendor last year and how much are they paying this year on a per employee basis or something like that and the amount you save them so you can uh and there is no conflict in this like in terms of the client saying no this is the wrong way to look at it or stuff like that like
01:41:32
Speaker
It depends. i mean, certain c clients want to do it, certain clients don't want to do it. So, plans for example, ah a client wants to buy these large valves and they have certain, you know, okay, we can buy, as long as it meets these certain guidelines, we can buy those valves from these different geographies. So then we're like, okay, we will help you buy that for cheaper, faster, and like better quality. Then are you willing to, you know, share a little bit of what you have saved?
01:42:00
Speaker
So some of them say, yeah, let's do it.

AI in Daily Life: Innovations and Adoption

01:42:03
Speaker
But what if the client feels that this price came down because it's the price of iron came down? Like, you know, like like these commodity prices. No, I mean, lending up it's a contract that we have drafted. So we are basing it on, and know, a certain, let's say, um value that, okay, we assume, so we are happy for this year. If you can buy us these things for this price, this is like we're happy. And then we say, okay, whatever we gain on top, we we take a percentage of that.
01:42:33
Speaker
Okay. So you only take this outcome-based pricing or you also have something fixed plus outcome-based? um So it depends on, again, like the, but if it's like a smaller company, then sometimes we even just do value-based pricing. And for value-based pricing, we have like a fixed and then we have like a yearly.
01:42:51
Speaker
So it really depends on like, so this is why it's like for when you do like enterprise sales, it's a very tailor-made sales solution. And it's, this is why it is like a three-month sales cycle.
01:43:04
Speaker
But with all these procurement things, I mean, because we have an agent on this side, i think it'll become very interesting when they also have agents on the other side, when they start to negotiate with the agents. So when the agents negotiate with agents, probably, you know, it'll all come down to the same that we have today.
01:43:21
Speaker
Yeah, I remember reading about how Bumble and Tinder, these dating apps are introducing like AI so that the initial chat when you swipe whatever left, right, I don't know what side, would but when that match happens, the initial chat is then by an agent. So, you know, like that date happening through two agents, I found that like.
01:43:41
Speaker
I see. Yeah. Yeah, I watched this movie recently, I think, Swipe Right or something. And it's about the story of Tinder and then how this woman then went on to create Bumble. Right, right, right. Yeah, yeah, yeah. Interesting. Okay. um What is, ah like, if you're not charging a company fixed price, how do you ensure they have skin in the game? Do you see, ah ah how do you see attitude of enterprises towards AI in terms of adoption? Like, like is there resistance or is it very easy for them to adopt? and
01:44:15
Speaker
Yeah, that's a very, very good question. So but yeah, like the resistance is from like the people because like we as humans do not like change. Even if it's helpful, we're just comfortable doing what we're doing. So it's definitely very, let's say, tedious to kind of go through that change. But if you have certain wins, uh, in, let's say, adjacent space or with their ah competitors, then I think there is a FOMO that comes into play and then they're more willing to kind of join that journey with us.
01:44:50
Speaker
Okay. Okay. Okay. Like, I mean, for, you know, I imagine there would have been a transition but where there was a time when automation went spreadsheets, right? Everything was done on spreadsheets. And then you got specialized SaaS tools, like like say, if you want to do employee expense reimbursement, you don't have to do it on spreadsheets anymore. There is a specialized tool for employee expense reimbursements, but then it comes with the uh the caveat that the human beings who are using the tool have to upload their bills etc on it uh so so there is that change management or behavior management which is needed yeah also learn to use that tool so yes learn to use that tool how to apply for a reimbursement and so on is there a similar challenge here because i feel like uh
01:45:37
Speaker
when you're replacing humans you don't necessarily have that change management challenge because you're just gonna tell somebody okay uh instead of processing 100 invoices on 20 different parameters that for each invoice you have to look at 20 parameters you just have to do 10 invoices which are flagged by the machine that this needs a human and in those 10 invoices you just have to look at five parameters uh Yeah, so I think ah we try to kind of
01:46:08
Speaker
I think a lot of success depends on how little the friction is between the human and this new way of doing it. So we try to reduce this friction as much as possible. But earlier, if they were just using, let's say, email, and now we want them to use this platform, so there is still that one friction coming in. But yes, I think the friction is still lesser than starting to use a completely new tool because you're just maybe creating an email draft. and But for example, right, like they were never used to looking at these email drafts to these vendors for less than $10,000. But now and they're having to look at that. So this is like a new paradigm for them. And this is where the friction comes in.
01:46:50
Speaker
But there's a transitional trans transitory phase, right? Because why would someone even need to go to a platform? The AI agent would eventually be sitting into Gmail only and maybe some of the mails I mean, once the trust is there, then why do you need to see the draft? The mail would go on its own.
01:47:09
Speaker
I mean, I feel like this would be an easier adoption path because people may not have stay Like, it's not asking them to change how they behave. Yeah, you're not asking them to change how they behave. You're just asking them not to do anything at all. Yeah, yeah that that was kind of yeah the undertone. So I really worry about the future in

The Future of Work: AI and Human Roles

01:47:31
Speaker
that sense, right? like ah Like roles where we are not adding too much value, where we are not doing something very creative. I think so like a lot of...
01:47:43
Speaker
redundancy has happened due to say AI coming in and then when these agents become mainstream I think that is where a lot of that is going to happen like in a big way where it's really gonna do like all these mundane things and currently we're using a human to like stamp give an approval that okay yeah send send send but then when you get like a thousand approvals in a row then you know that that's you don't need that stamping monkey anymore so so yeah Are current LLMs good enough to ah not need human supervision for, let's say, 90% of the work?
01:48:21
Speaker
um Sometimes not to start with, but like once they go undergo that reinforcement learning thing, yes, I think they are usually 90% and better. But you still need like humans to do that 10%, right? Because failure is not an option because you might end up paying. yeah What if that one mistake that it does, it adds like five zeros, right? So like you still need to have like these different guidelines. like We do have like you know like different guardrails built around, like especially numbers and money, but um I was just giving you an example.
01:48:52
Speaker
Okay. Okay. Okay. Understood. Understood. So let me end with this. For what do you think are opportunities today in like, you know, considering the current state of AI models and the current state of technology, what do you think somebody who wants to build something today is what what are spaces, white spaces where people could build companies in?
01:49:15
Speaker
Yeah. So I think, I mean, the white spaces are where there is friction and usually like like if friction is in something very shiny and interesting, then there are a lot of people looking at it. But if it's like really some back office boring work, I think there's a lot of opportunity in kind of reducing that friction. And I think we should start from like empathy. We should look at like what the customer wants and like first like understand them and then try and build it instead of just thinking of it in our head and starting to build it so we want to kind of have that compassion that empathy look at where people are suffering go and talk to all of them then come and build something very quickly and then quickly iterate take it to them give it to them reduce their suffering if you can and if not like iterate and in my opinion like
01:50:11
Speaker
starting is the most important thing, right? Like once you kind of take the first four steps, you will get to see what is on the next four. So I think instead of thinking, because no idea comes out like fully formed, right? Like you kind of will pivot many times and your idea will evolve and it'll become kind of real after a while.
01:50:32
Speaker
Thank you so much for time, Arjun. It was a real pleasure. Yeah, thank you. It lovely. Thanks. for Bye.