Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Why future applications of AI will need higher quality data image

Why future applications of AI will need higher quality data

S4 E24 · Bare Knuckles and Brass Tacks
Avatar
118 Plays18 days ago

What if the real AI revolution isn't about better models—but about unlocking the data we've been sitting on?

Mike McLaughlin—cybersecurity and data privacy attorney, former US Cyber Command—joins us to discuss something most people miss in the AI conversation: we're building the infrastructure for a completely new asset class.

The conversation moves past today's headlines and LLM limitations into what becomes possible when we solve the data access problem:

Research acceleration at unprecedented scale. Imagine biotech startups accessing decades of pharmaceutical failure data, every null result, every experiment that didn't work. That's years cut from development cycles. That's drugs to market faster. That's lives saved.

Universities as innovation accelerators. Right now, research institutions pay to store petabytes of data collecting dust on servers. Mike argues they're sitting on billions in untapped assets to fuel innovation.

Beyond synthetic training. The next generation of AI won't be trained on Reddit threads and scraped websites. It'll be trained on high-quality, provenance-verified research data from institutions that have incentive to participate in the ecosystem.

Mike's vision isn't just about compliance or risk mitigation. It's about creating the conditions for AI to actually deliver on the promise everyone keeps talking about. The compute exists. The capital exists. The models are improving. What we need now is the mechanism to turn decades of institutional research into fuel for the next wave of moonshot innovation.

Mentioned

Google licensing deal with Reddit

Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples

MIT researchers discover new class of antibiotics using machine learning

Reducing bacterial infections from hospital catheters using machine learning

Recommended
Transcript

Rise of Data Centers and AI Challenges

00:00:00
Speaker
I think if there is a market correction, it's not going to be for a lack of compute. We've got billion dollar data centers going up in just about every state, including ones that make no sense like Arizona.
00:00:11
Speaker
Oh, by the way, AI data centers create an incredible amount of heat and they need to be cooled down. If you put it in a desert, that's problematic. But we do it in like Kentucky and Minnesota and Nebraska and you see wheat fields and corn fields all being transformed into these huge AI data centers. It's not going to be a compute issue. It's not going to be a storage issue. It's not going to be a tech issue. It's going to be a data issue.
00:00:33
Speaker
And so if we do see a market correction, it's going to be because synthetic data just doesn't get us where we need to be. Reddit threads aren't going to get us where we need to be. Scraping open source and readily available data is not going to get us where we need to be. Unless we unlock access to good, high quality data from research institutions, universities, and corporate holdings and give them an incentive to monetize that data, that's where we're going to see the market correction.

Introduction of Hosts and Guest

00:01:06
Speaker
Hey, welcome back to Bare Knuckles and Brass Tacks. This is the tech podcast about humans. I'm George K. And I'm George A. And today our guest is Mike McLaughlin.
00:01:18
Speaker
I don't know what to say about Mike. I've known him for a number of years, came up through cyber in the hardest core way possible. I met him when he was at U.S. Cyber Command. He has since gone on to be a policy attorney for cybersecurity and privacy. But today we wanted to talk to him about something different, about data as a protected asset class and quality data for AI development.
00:01:41
Speaker
This is a great conversation because it's one that not enough people are having. Yeah.

Mike McLaughlin's Expertise in Cybersecurity

00:01:47
Speaker
I mean, the you know same kind of background with Mike. Um, I think I'm, I might've met Mike around the time we started our show and, uh, we discovered that we have a very, very similar working background, military background. And then, uh,
00:02:00
Speaker
You know, it's it's kind of interesting because he is probably, I would say, easily one of the top five ah cyber, if not technology lawyers in America today. And that's not an exaggeration.
00:02:12
Speaker
um But he is extremely down to earth, very knowledgeable. And i think he... Unlike a lot of people who have jobs like his, who live in an ivory tower and don't really understand, you know, real world implications, the problems that they're discussing. Mike very much understands what are the day-to-day impacts, what are the commercialized impacts and how this really affects everyone's everyday lives. So the perspective that Mike brings is, i mean, I can't believe we got to have him on the show for free because people pay him a lot of money for this kind of thing. Those are the people we roll with. People who do us solid favors.
00:02:49
Speaker
So ah yeah, Mike jumped on at the last second. Yeah. um So for the record, Mike and I are working on a project that's currently in stealth at the moment.
00:03:00
Speaker
And our combined knowledge on this specific use case kind of has a lot to do with what we're building. And so if you're following Mike, which I encourage you to do so on LinkedIn, we get into a lot of the stuff that he's been recently talking about, how we value data, how we make it legal and licensable, especially to unlock the next era of

Importance of High-Quality Data for AI

00:03:21
Speaker
AI development. So we're not running into this, I scraped all the available things without anyone's permission. mission sort of thing, but also what kind of data is really needed to drive the frontier, not just LLMs. I repeat, AI is not just LLMs, but advances in materials science, molecular biology, ah renewable energy, all the stuff that we are really excited for and might give us the Jetsons future that we were promised. But without further ado, let's turn it over to Mike McLaughlin.
00:03:57
Speaker
Mike McLaughlin, welcome to the show. Hey, thanks, George. Thanks for having me. Yeah, we are excited to have you here. um I've known you, well I think we just calculated this since 2018. That was on the cyber side of things. You have since moved on from U.S. Cyber Command into law practice.
00:04:19
Speaker
I remember you doing night school, which was insane, and then on to policy and now some new ideas. So that's the reason we're talking today. So why don't we start there? You know, we'll move outside of cyber, but I'll give you some space to describe sort of what's pinging around in your brain today as it relates to data and AI. And we'll take the conversation from there.
00:04:43
Speaker
Yeah, absolutely. And so I think the first thing whenever I talk about cyber with people, and I always tell them, I'm a cybersecurity and data privacy attorney. they look at the the two things and they think they're mutually exclusive and inherently they're not. um Any issue that you have when it comes to cyber also touches on data.
00:05:01
Speaker
um And more and more what we're seeing is it's not just cyber, it's not just security issues, but it's really all of the AI development that has exploded over the past two to three years.

Legal Risks in AI Data Usage

00:05:10
Speaker
And that continues to come pace with more and more companies adopting either GPTs, LLMs or building out their own models.
00:05:17
Speaker
What we're seeing is that there is a thirst or a hunger for data But that thirst and hunger bears with it a whole lot of legal liability and uncertainty that is either going to crater a lot of these companies that are trying to adopt these AI models without a solid understanding of it, or it's going to increase their risk of litigation, both of which are really, really bad for our ecosystem and don't help to foster innovation. And that's something that we're trying to resolve right now And when you say liability, or we're just, for our listeners' sake, we're talking in the realm of this idea of what is...
00:05:51
Speaker
open source versus what is freely accessible, quote unquote, like all the euphemisms that were justified to feed the internet into l lms
00:06:05
Speaker
I would actually have been in a couple of different ways. The way I like to think about this is you've got your you've got your intellectual property liability, and that's going to be, do you actually have the rights to use the data that you're using to train your models? This is going to be primarily under the Copyright Act. And so think, recently there was a lawsuit that settled between, it was called Barts v. Anthropik. And this was, Anthropik went and scraped Encyclopedia Britannica and took a whole lot of data, didn't used it to train Claude.
00:06:31
Speaker
um The problem with that was anybody who wrote for Encyclopedia Britannica or anybody whose data was or whose information, whose intellectual property was being used as part of Encyclopedia Britannica, that was all licensed.
00:06:44
Speaker
Well, Anthropic, when they went and scraped it, they didn't get that license. They basically just pirated the data and plugged it into their model, trained it, and then had that be the output or had that be the what the model was based off of. And that's something you can't do because that infringes upon copyright and infringes upon intellectual property. Those original creators of that that information, of that data, of that intellectual property, they need to be compensated.
00:07:07
Speaker
And so if they're not, that falls under the Copyright Act in the US, at least. And there are significant penalties. um It can go from $750 per infringing event to up to $150,000 you can show that it's willful. In the case a lot of these AI developers, it's hard to say that it's anything but willful because there

Cybersecurity Threats to AI Training Data

00:07:25
Speaker
is no guardrails on it. they're just collecting data.
00:07:27
Speaker
Yeah, and I believe that settlement was to the tune of like $1.8 billion. So it sounds pretty willful to me. It was. And Anthropoc actually got off easy because the upper bounds of what that judgment could have been was about $72 billion. dollars So we talked about like existential threat to companies if they don't understand the fire that they're playing with, it's pretty significant. but that's the first bin is the intellectual property side. But it's not just IP. It's not just that liability. It's also when you start to think about the security aspect of it. yeah
00:07:59
Speaker
I have been saying this for probably so probably a year plus at this point, and people are just like, all right, you're the tinfoil hat guy. I truly think the next wave of cybersecurity incidents is going to target AI. and It's going to target AI training data. You think what we are seeing right now when it comes to ransomware and targeting and encrypting files and making it inaccessible to companies and then they have to pay the ransomware, they have to completely rebuild.
00:08:21
Speaker
Imagine if you've got ransomware actor who gets into your training data and just inserts poison data in there. such that you have no idea what your model is actually going to output. And if you've got an AI tool that's used in operational technology, or if you've got an AI tool that's used to make very significant decisions in healthcare, for instance, and you can't trust the training data, that's a huge problem. And that's not a problem you can easily fix simply by rebuilding your infrastructure. You basically have to completely rebuild the model.
00:08:49
Speaker
But I think that's where we're going to start to see cybercrime go. If you don't trust the provenance of your data, if you can't prove the provenance of your data, you're going to be in a very, very bad way. This is essentially the the where we found ourselves in cybersecurity when most companies moved to backups and started backing everything up every one to two weeks so that they can overcome this problem. This is something that nobody's thinking about right And this is going to significant issue forward. point, I think it's been shown that it doesn't actually take a lot to throw off the model because you're dealing with statistical ah derivations. And so if you you, you could have gigabytes of data, but it only takes a small percentage deviation to throw everything else off, right Because it cascades through the calculation.
00:09:33
Speaker
So Cornell University in October published a report on this very topic and they looked at how many documents, how many individual documents would it take to poison a large language model of any size. The number they came up with was 250. You can have a billion token model.
00:09:50
Speaker
And it takes 250 documents to destroy it and to poison it completely. And if you think about how many that is, 250 pages in an Encyclopedia Britannica volume, it's nothing.

Data as a Valuable Asset Class

00:10:01
Speaker
But that's what we're talking about. And so it's very small numbers that can have this huge cascading outside effect. Yeah, i think i think kind of the big problem too is, you know, as we kind of go through and and talk about how we solve a lot of these issues,
00:10:16
Speaker
I think a bigger issue is that individual client-side organizations, and Mikey, you know, you represent many of them, have no um visibility or knowledge of who actually owns their own data, who actually has rights to that data.
00:10:32
Speaker
and And as we're talking about with the issue in the lawsuits, right? it think it was one of the first times that that data was actually validated as something that was defensible in a civil litigation, right? And I think that's kind of like where we also see a lot of commercialization opportunity, right? So I think where it's difficult is trying to find a balance between a secure solution where data can be transacted or exchanged in a secure manner, but then the actual commercialization of the opportunity. Because we've been hearing as well for almost 20 years now, data a new oil, data is new oil.
00:11:06
Speaker
Well, is now the era where we now start to mine and drill that oil and actually. Or we have the like refining capacities to extend the metaphor. The the use cases. I mean, are the use cases also then becoming a lot more diversified as now people are understanding that the trade and exchange of data is going to become its own market going forward.
00:11:32
Speaker
What are your thoughts on that? Well, I think that's I think it's an inevitability. And what you're talking about, George, is data as a separate asset class. If you put that in the context of cybersecurity and you look at the cybersecurity insurance market, for instance, cyber insurance, they are pretty good at pricing out risk when it comes to traditional things. Right. Look at health insurance.
00:11:55
Speaker
You have to go to a doctor, you look quite literally get an enema so that you can determine what is your risk of, what's the insurance carrier's risk of loss when you're getting life insurance? Do you smoke? Are you 300 pounds? Do you have you know a bad heart? do you have a history of genetic disorders? Based on all of those factors, they they tell you what your premiums are going to be for a certain amount of life insurance.
00:12:18
Speaker
But when it comes to cybersecurity, they don't do that. They simply say, what industry are you in and what cybersecurity controls, what framework are you aligning to? Well, that completely negates the data element of risk.
00:12:31
Speaker
And so if you look at a healthcare organization, for instance, a pediatrician's office that's just dealing with kids day in and day out with sniffles and pink eye and whatever else is going to have a very different risk profile for an insurance company than an AIDS clinic or than an abortion clinic.
00:12:47
Speaker
Because if you think about the liability from litigation that's going to follow one of those, there is a significant risk of harm to those individuals whose protected health information could be compromised in the latter two instances versus in the former. Because the pediatrician's office, it's just inherently they're not handling that type of data.
00:13:05
Speaker
The problem that insurance carriers have is they don't have a way to value that data. What is the difference between the value of data at a pediatrician's office versus an AIDS clinic? And there's no way that anyone is looking at that right now. So insurance carriers are basically throwing their hands up and saying, a healthcare care provider is a healthcare provider is a healthcare care provider. And they're they're pricing that in the same way.
00:13:25
Speaker
Right now, because we don't have that mechanism to actually properly evaluate data, It's a huge risk to insurance carriers and a risk of catastrophic loss should there be some sort of cascading or some sort of significant cybersecurity incident that's affecting an entire industry or that's affecting an entire geographic region.
00:13:42
Speaker
And that's just one area. We're talking about asset classes, though, but that asset class of data, being able to evaluate that, really important for insurance. well it's also really important if you're in mergers and acquisitions.
00:13:53
Speaker
If you're a company and you're trying to acquire another company, you typically look and you say, well, what are your, you know, how many, what hardware do you have? Your your computers, your servers, your racks. You look at the vehicles that are in their fleet. You look at the building that they are leasing. You look at all of these different types of assets when you are factoring in what is the value of this company?
00:14:13
Speaker
What about the data that they're holding? Nobody's actually looking at that saying, well, the data that's in your stores, what is that worth? What's it worth to you as a company? What's it worth to me as a buyer? What's it worth to an AI developer? Should we want to license this data? Hold on. if If they even know the data they have. they know Right. Right.
00:14:29
Speaker
and right right And I think to ah the defender's point of view, being able to value that data would also give them more grist for going to get budget to protect it, right? Because then the company understands the monetary value to that risk rather than like vague operational risk, downtime, fudge the numbers here, it kind of costs this much, to you know, versus like if we lose this petabyte of data, it is worth X, Y, and Z, right?
00:15:07
Speaker
yeah um I want to go back to what you'd said about American innovation. I'm sensitive to that because I remember we first got acquainted with influence operations and that was affecting an electoral process. Also the most American of institutions ah that is sacred to us.
00:15:26
Speaker
But um yeah, I want to pull that thread a little bit about instead of just being the tinfoil hat guy, How do you see building a system that understands how to measure data or how to license it properly?
00:15:45
Speaker
Can you talk a little bit about that, not just from the sort of litigious side, but like, how does this spur, how does this get us forward? Well, so ah let's let's back up a little bit and talk about the big picture of where we sit right now when it comes to

AI Data's National Security Implications

00:16:02
Speaker
AI. And because really, when we talk about data, that's that's the 800 pound gorilla in the room is AI, because that's what's sucking up all the data. That's where we need to primarily put quality data. And that's really what's going to lead to innovation going forward.
00:16:16
Speaker
And anybody tells you that it's not is like the people who said, you know, don't really know about this whole internet thing. I'm a bit skeptical. I don't know if I'm going to buy it. People won't buy online. Right. Who's going to do that, right? That's where we are right now. And so when talk about that, we look at from a national security perspective. And being, know, former military, former government, I'm always looking at how do we make sure that we are ensuring our economic viability as the West, our military viability, our national security interests.
00:16:45
Speaker
When you look at China, who is our, i mean, call spade a spade, our primary competitor in the space, from an AI perspective, they have access to basically and bottomless troves of data. You've got 1.4 billion people that the PRC is actively collecting data on, on a daily basis, and they're making that data widely and readily available to their national champions. And so companies like ByteDance, which are huge, huge AI developers, they have access to all, not just the TikTok data, but all the data that's coming off of the surveillance systems throughout China so that they can develop the best facial recognition ai models in the world, period. And those AI models and those tools can then be used for military purposes. And then we, God forbid, we get into a shooting war, are going to be used against us. So how do we make sure that we are enabling or we are fostering an environment where the data that we are making available to American AI developers or to Western AI developers is as good, if not better, than the data that China's getting access to? it And really, that's what it comes down to from a national security perspective.
00:17:51
Speaker
Well, currently, you look at our our universities, our research institutions, By and large, they don't have a mechanism by which they can make all their research data or the data that they have in their stores readily available to smaller AI companies. If you look at all the big AI companies, and these are your Anthropics, your Google Gemini, your OpenAI,
00:18:11
Speaker
the data that they're getting isn't good quality data. It's volume data. They're scraping the internet more broadly. They're trying to get access to open source data. You've got Google entering into data agreements with Reddit for $65 million dollars for all of their data on an annualized basis.
00:18:29
Speaker
that's great, but you're not going to build models that are going to cure cancer on Reddit data. You're not going to build models that are going to help to identify or yeah or distinguish between a you know a naval combatant in a fishing vessel using Reddit thread.
00:18:43
Speaker
It's just not going to happen. So the question becomes then, if we're going to if we're looking at this from a data perspective, how do you get quality data in the hands of ai developers that are really on the cutting edge, that are being very innovative and that have very specific use cases for that data, not just give me an image of a cat sitting on George's head. Although yeah I do use chat GBT for that pretty frequently. not any fear Fair play. that tro israel Yeah.
00:19:10
Speaker
Hey, just a quick word to say thank you for listening and to ask a favor. If you're digging the new direction of the show, which is looking more at human flourishing and the impact of technology more broadly, share it with friends.
00:19:23
Speaker
It really helps the show. We're really trying to grow something here organically. We don't do paid media. We don't do a lot of sponsorships, so we'd appreciate getting the word out and getting it to people who care about the questions that we're tackling, how to keep tech human, and how to make technology work for us instead of the other way around.
00:19:47
Speaker
Thanks so much for your time and attention. And now back to the interview.

Challenges in AI Data Marketplaces

00:19:52
Speaker
Now we get to the quality data aspect of it. And right now there's no way for, there is no Amazon for data. There is no Etsy, there is no eBay for data. There's nothing that a company can go onto or an AI developer can go onto and say, am looking for X because this is what I need for my model.
00:20:09
Speaker
More than that, a lot of AI developers don't necessarily understand how to how to effectively articulate their data requirements. they get up model They get model requirements and they say, I need my model to do this. I need it to distinguish between a naval combatant and a fishing vessel. I need it to be able to parse through this molecular compound so that we can create a cure for cancer.
00:20:30
Speaker
Great. And what data do you need in order to get to that point? Now you need data analysts. You need people who understand both the problem set as well as the different types of data to be able to translate that into data requirements. And then once you have those data requirements, you need to actually go and find that data, figure out who has it.
00:20:47
Speaker
That's a really long tail. And if you're just a small so plucky AI developer, Your AI developers, your actual engineers are the ones who are having to do that entire process. There doesn't exist a marketplace where an AI developer can simply say, I'm building this model. Here are my data requirements. Who's got the data?
00:21:03
Speaker
but But then i have to look at it like this, right? And I see the issue is, first of all, that a lot of people in those client side positions or even in development positions don't actually understand what a proper model data set looks like.
00:21:17
Speaker
I think there is a massive void and gap in education in the market because everyone can hop on, you know, a GPT or something like that. And then, yeah, great. You you can do really cool things with prompts. And then they think they're a prompt engineer. Like I have met so many people who are not in tech who come up to me because they know I work in tech and they're like, oh, hey, like I want to work in AI. And they set up some weird like coaching business.
00:21:42
Speaker
like based on what they've done with an open source model. And I'm just like, okay, so there is clearly an education gap. How and do you think, Mike, do we do we bridge the gap to educate the market to understand what a good model actually looks like, what that data set consists of, and then how do we get them to kind of trust in the process? Because everything that that you're discussing is gonna require something absolutely bleeding edge you know, net new in terms of of a concept for people to actually be able to trust it, to put things that are high value and actually try to create that, that asset business class.
00:22:22
Speaker
So I think it's a couple of things. So the, the education piece is certainly going to be there, but it's, i don't think it's as heavy a lift as you think. And, and when I say that you look at the way in which, and I, I, again, i'm going to bring things back to cybersecurity.
00:22:35
Speaker
In the early 2000s, cybersecurity education and training just was non-existent. It just didn't happen. you you You heard about computer viruses or you you knew that there were phishing emails out there generally, but we didn't have like know before where you would have the cybersecurity training or the phishing simulations that would come in that you'd be able to use or or any others.
00:22:56
Speaker
You didn't have these very bespoke tabletop exercises that you were giving to your your incident response team and your executives and your boards. That has been an evolution over time where that training really came online out of necessity.
00:23:09
Speaker
I think we're going to see that same necessity come up in AI and particularly for AI engineers, for AI developers, for people who are actually working the space and building out these models.
00:23:20
Speaker
And the type of training that we're looking for is going to be distinguishing between quality data, distinguishing between um license data or legal data and that which is not. Identifying or building out the tools are going to identify poison data sets. All of that is going to come as we continue to develop this.
00:23:37
Speaker
But to the point of how do you actually train a workforce such that they are, they're able to do this appropriately. think it's going to be an evolutionary process. um But ultimately, we need to have a we need to have a model that people go to and point to and say, this is, well, yeah, bleeding edge, and net new. It's something that we recognize.
00:23:55
Speaker
And so if you look at you look at Amazon, you look at Etsy, you look at eBay, you look at these big marketplaces that exist online, it's really easy to just point to one of those and say, look, I buy groceries on Amazon and they get delivered to my door. I buy a bike through eBay because I want a pre-owned bike.
00:24:11
Speaker
Something people know. You can take that same model and you can say, I want data. And an AI developer can come in and say, yup this is how I get data. This is how I find it. This how I license it appropriately. This is how I know that I've got all the legal mechanisms and controls in place. This is what I'm going to go do.
00:24:27
Speaker
Then you don't need that significant evolution for training. It's simply adopting a model that people are already familiar with. And I think, well, I'll say two things. The first is, to your point, George, whenever I can, i do try to be that asshole who it tries to set the stage of like, Hey, I just need you to all to remember that AI is not just chat bots, right? Cause there's this, we're in a moment culturally where generative AI, which is a specific type of machine learning is being conflated with AI period. And then to your point, Mike, I think
00:25:09
Speaker
and We're in a chicken or egg situation or a cart before the horse, which is when better quality data becomes available, we will see more innovation that can utilize it. Right now, when you're in a startup and you're like, oh, I want to create something with ai the stuff that you have available is mostly generative models.
00:25:32
Speaker
And so tends towards text-based, image-based features. Generative stuff. But to your, what you'd said previously, like, we're not going to get advanced material science with LLMs drunk on Reddit threads, right?

AI's Potential Beyond Current Models

00:25:49
Speaker
Like, there's just, isn't that there's no there there. And so I have cited these two things, I think, repeatedly on the podcast, and we'll link to them, but You know, when Anima Anankumar at Caltech highlighted this research that they did to learn more about bacterial infections in hospital catheters, they analyzed the fluid dynamics using machine learning, but it was high quality medical data.
00:26:19
Speaker
And then they use 3D printing to make microscopic, ah you know, angles or whatever. It reduced infections insane amounts, like saves hundreds of thousands of lives. But that's not like cool with a K chatbot.
00:26:33
Speaker
And then um again, to your point, like MIT researchers discovered a new class of antibiotics, but it took years. grad students pipetting samples. It wasn't, we're going get that off the internet. So that's the that's an AI that I want and that I'm here for is like the really advanced stuff.
00:26:53
Speaker
well And George, the thing about this, for all of what you're describing, you're talking about things where people were successful in their research that gets published that gets put out and you're finding really good quality data that's public.
00:27:04
Speaker
Think about all the failures. Think about all the research that does not lead to success, that doesn't get reported. You're talking, again, about petabytes of data that doesn't actually get reported.
00:27:14
Speaker
Well, if you run an AI model against failures, guess what you're going to find? Same thing Thomas Edison found, right? 999 ways not to build a light bulb. But you're going to come up with these insights that ah nobody else would would even look at or that nobody else is looking at. We simply simply do not have access to that data right now.
00:27:30
Speaker
To be able to open that up. And yes, to your point, like like these things are coming out of research universities because they're doing the research. So there's not a private enterprise. It's like, I would love to build a company that does this because I don't know where to get the data. They either have to get spun out of universities or, you know. Let's let's go back to like the proper scientific method, right? Like what's the scientific method?
00:27:52
Speaker
The scientific method is you come up with a null hypothesis and you try to prove the null hypothesis in your taste test test, right? This is literally exactly the way science is supposed to be done.
00:28:04
Speaker
So I think, I think there's a big value in taking a lot of this, we'll say failure data, and and that can be packaged up to an organization that's trying to research to build something because it'll save them time. Like, oh yeah we got we had Hunter on the show, right? ah like So just for context, I have a friend who's also trying to to do his own startup. He's built a proprietary vaccine.
00:28:26
Speaker
Something that could really change the game in terms of managing cholesterol. It's really fucking cool. Concept is great. Now, imagine if he had a massive repo of failure data.
00:28:38
Speaker
How much quicker would that make it go to market process? Don't do all of this. Yeah. Right. And again, I think it goes back to the education piece because how we actually conduct these processes has to change. We are still playing in a game like it's 2015 or 2005. And Mike, I'm just, I'm struggling where,
00:29:01
Speaker
How do we evangelize to people that the way that they conduct their business, the very research and and acquisition and operational processes that they do all have to change because I feel that organizations are too stuck in the mud in their old ways and they simply can't get around it.
00:29:20
Speaker
Yeah. Yeah.

Incentives for Data Sharing and Market Correction

00:29:22
Speaker
Well, and I think there are two sides to this. Because if if you look at it, one is the essentially supply and demand, right? The principles of economics here, the demand side is always going to be on the AI developers, those who are looking to ingest good quality data to ah be be able to identify that data and get access sex to it very quickly. But then there also has to be something on the sell side. There has to be an incentive for research institutions, for universities, for commercial entities to want to sell their data. And that's where we we talked about the asset class earlier.
00:29:51
Speaker
how do we get universities to identify and to look at their data, the research data that they're simply storing now that's costing them money in in cloud infrastructure and on-prem servers, electricity? How do you get them to look at that data as an asset and to monetize it and then have a vehicle to monetize it And I think this is the change that we need to see happen because once you've got large quantities of data that are made available,
00:30:14
Speaker
for searching, for filtering that AI developers can readily look for, can readily put their requirements against, or identify new opportunities simply based on the data that they're looking at, then you have a model where the the demand side and the supply side are meeting. That's where your price is going to come in. That's where you get that asset class. That's where you get that valuation of the data that can then be used more much more broadly across multiple industries.
00:30:39
Speaker
But I think that's the change that we need to see. I agree with that. like that's That's the problem, I think, is is As we kind of begin the education piece, I think folks who have have vision on the on the problem and I think vision on where the future is going, I think it's our responsibility to kind of educate the market as much as we can. And and i I hate it because, you know, i I hate stupid people and it's frustrating to talk to them. And I hate people that are obtuse. But at the end of the day, you talk to me every day. You know I'm an asshole.
00:31:13
Speaker
But I'm just saying that we have to, you know, put our, our great school educator hats on. And I think, you know, my, my, The thing I spend a lot of time thinking about is how do we convey at scale to the market?
00:31:29
Speaker
Especially, you know, we we just saw the headline today, XAI just got, what was it, like another $20 billion dollars a Series E round, right? just to build compute Just to build a few more offerings. So people are throwing money at this, but I don't think they understand what exactly they're putting money into.
00:31:48
Speaker
And I think, like I'm not saying there's going to be a bubble, but I'm saying, is the fact that there's going to be a market correction, which I believe is more appropriately what's going to happen, is that going to impact the ability to commercialize data in the way that we predict?
00:32:05
Speaker
So I think... if there is a market correction it's not going to be a result of a lack of demand for these types of models i think that train is very much left the station we're seeing ai being implemented in everything i mean if you go on google you're not actually doing a google search you're plugging something into gemini and you're getting response or a summary of the search results first and foremost Their search engine optimization has completely changed since the advent of generative AI. And that's great because it gives better quality answers.
00:32:38
Speaker
But then you look across the board. I mean, you go to your, you go to your, you know, a kid's pediatrician or you go to your physician and you want to just go to the website to make an appointment. Suddenly you're talking to an agentic AI chat bot that's trying to give you medical advice. I mean, that's a huge change. Those types of things are going to continue. And so we are going to see champions emerge. We're going to see better quality AI. We're going to see that a competitive market start to weed out some of the smaller ones.
00:33:02
Speaker
I think if there is a market correction, it's not going to be for a lack of compute. We've got billion dollar data centers going up in just about every state, including ones that make no sense, like Arizona.
00:33:13
Speaker
Oh, by the way, ai data centers create an incredible amount of heat and they need to be cooled down. If you put it in a desert, that's problematic. But we do it in like Kentucky and Minnesota and Nebraska and you see wheat fields and corn fields all being transformed into these huge AI data centers. It's not going to be a compute issue. It's not going to be a storage issue. It's not going to be a tech issue. It's going to be a data issue.
00:33:36
Speaker
And so if we do see a market correction, it's going to be because synthetic data just doesn't get us where we need to be. Reddit threads aren't going to get us where we need to be. Scraping open source and readily available data is not going to get us where we need to be. Unless we unlock access to good, high quality data from research institutions, universities, and corporate holdings and give them an incentive to monetize that data, that's where we're going to see the market correction.
00:34:02
Speaker
Yeah, well, that seems a perfect place to end. I want to thank you for jumping on last minute, Mike. I know we extended the invite pretty late, but always appreciate the conversation. And I think this is an important one because, again, this paradigm that we have now, we should not take for granted is the paradigm of the future, right? Large scale LLM, build all the things, suck up all the available information on the open Internet.
00:34:30
Speaker
is but the first salvo. I think the future is very much in and higher quality data because it's the only way we can kind of move the needle and through the wall, I guess, just to mix metaphors.
00:34:44
Speaker
Absolutely. Cheers, brother. All right. We'll talk to you soon. Thanks, guys.
00:34:52
Speaker
If you like this conversation, share it with friends and subscribe wherever you get your podcasts for a weekly ballistic payload of snark, insights, and laughs. New episodes of Bare Knuckles and Brass Tacks drop every Monday.
00:35:05
Speaker
If you're already subscribed, thank you for your support and your swagger. Please consider leaving a rating or a review. It helps others find the show. We'll catch you next week, but until then, stay real.
00:35:20
Speaker
because all know each other. Perfect. Just one thing. Hard K on my side my name. McLaughlin. McLaughlin? Like Sarah McLaughlin? Just don't make that comparison. But I've never said, haven't had to say your last name in like five years, I think. That's why.