Revolutionizing Business with AI
00:00:00
Speaker
On Developer Voices this week, we're talking about machine learning and from a novel angle, I think, because there's plenty of talk out there about what AI can or could do for you. Is it going to revolutionize your business? Is it going to take over the planet? Is it just going to crash your car? But for all that talk, there's not actually that much said about how we get from an exciting idea to practical production grade software.
Introducing Adi Polak's Book on Scaling ML
00:00:29
Speaker
Well, thankfully, this week's guest has been thinking about it a lot over the years. It's the fabulous Adi Polak. In the past, she's been found assembling production ML pipelines at IBM and Akamai, and she's just released a new book about it with O'Reilly called Scaling Machine Learning with Spark.
00:00:49
Speaker
Her book takes in the whole pipeline from enabling data scientists to build their models, to how you might deploy the models, to monitoring them and setting up feedback loops so they improve and all that stuff.
00:01:02
Speaker
And I think it's interesting, not just
AI's Impact on Business Roles
00:01:05
Speaker
because AI itself is interesting in the current industry hot topic, but because there's a good chance that the practical problems are going to cross your desk some point in the next few years, whether you're the person instigating AI in your company, or you're just the person being told to support it.
00:01:23
Speaker
So, let's get learning. In fact, let's go meta and do some machine learning. I'm your host, Chris Jenkins. This is Developer Voices, and today's voice is a depolac.
ML with Apache Spark Overview
00:01:49
Speaker
Joining me today is a depolac. Adi, how are you doing?
00:01:53
Speaker
I'm doing great. I'm super excited to be here with you today, Chris. It's great to have you back. Although the last time we talked in a podcast, we were physically in the same room. And now we've got to do it over the web. Yeah, different times, I guess. Yeah, occasionally we get
Understanding Machine Learning Phases
00:02:09
Speaker
to travel now. But so you have, since we last met, you've just released an O'Reilly book called, let me get this right, Scaling Machine Learning with Apache Spark, right? Right.
00:02:23
Speaker
And the more I've thought about that title this morning, it's almost like a haiku. You could almost unpack every single word, scaling, machine learning, Apache Spark, three big topics, and you've got to put them all together in one book. Let me, so where are we gonna start on this? Let me tell you my understanding of machine learning, and then you can start to talk to me about how it scales or where I'm wrong, right? Sound good?
00:02:55
Speaker
Okay, the way I think about machine learning is it's a two-phase process. You've got a big bunch of data you want to learn from, you do some fancy maths on it, and you get a model out. And that model's essentially a function. You take that function somewhere else and you say, hey function, here's a picture. And the model says, there's a 70% chance that's a picture of a cat. Is that roughly how it works out?
Aligning Business Goals with ML Systems
00:03:23
Speaker
Roughly. Yeah, well, it's like a specific scenario that we're discussing of image classification. There are multiple scenarios also, but yes, this is how things work. The process, the pipeline, tends to be naturally a two-phase thing, which you've got to think about how to manage and how to scale, as well as the individual components in that pipeline.
00:03:51
Speaker
We can split it into two sections. Like you mentioned, one phase is the development itself and the second one is after deployment when it's in production. Of course, each one of these would have their own kind of sub-tasks or sub-phases that people need to build in order to get through the different stages.
00:04:14
Speaker
Yeah, but roughly, one is the development, research, and so on. And the second one, now that it's in production, what do I do? Or how do I bring into production, and now what's next? And you cover both, you try to cover all of that pipeline fairly thoroughly in the book, right?
00:04:34
Speaker
Yeah, it starts from the very beginning of, you know, what can machine learning do for your business with some examples from big companies? What other companies are doing to drive more business? And then aligning the business goals with, you know, the actual planning of how do we build a machine learning system, machine learning pipeline workflow at the end of the day, and then goes into all the nitty-gritties of let's get practical. This is what we do. This is how we can leverage
00:05:03
Speaker
Existing tools that we have in our organization. What are the tools that we might need to bring how to do kind of the decision making process around pros and cons and breaking it down and not just taking
00:05:18
Speaker
kind of like a blueprint of something else. It's like actually thinking what would be beneficial for the organization that the individual is part of, which I think is really critical for people to do as an exercise and also for people who are interviewing and new to the space. So people that are just joining, it's good to have that kind of like critical mindset and critical thinking.
Why Choose Apache Spark for ML Pipelines?
00:05:44
Speaker
And then it dives into the actual technology. How do you do?
00:05:47
Speaker
How do you deploy different deployment patterns? How do you monitor your model and production? And when do you archive it? Like when it's done and you need to start training again. Yeah, because there's also the whole iteration part of this process, right? Exactly. Which in itself is a lot of work.
00:06:13
Speaker
I can see why this becomes actually quite a large infrastructure thing by the time you want to actually go to production. Why did you make Apache Spark the backbone of that?
00:06:25
Speaker
Yeah. Apache Spark is one of the most adopted and used technologies in the world in the data and analytics space.
Distributed Computing and Data Analytics with Spark
00:06:36
Speaker
We used to call it big data. Today, we'll say advanced analytics or analytics at scale or different wording. But at the end of the day, it means that we need multiple machines in order to compute the results that we want to see at the end of the day because one machine is just not enough or it will take forever to compute.
00:06:55
Speaker
And so Apachesburg gives us that generic engine to run distributed computing on top of large data. And a lot of people have been using it for that, for analytics, for data pipelines, for scaling data pipelines. But actually when it was just started back at the university lab, AMP labs,
00:07:22
Speaker
It was started, it was initiated for helping machine learning researchers scale their efforts because a lot of the tooling they had, yeah, I know, not a lot of people know that, but a lot of tooling they had and as were not scalable or Hadoop, Hadoop MapReduce and Mahout were very hard to work with for data scientists because you had to understand all
00:07:48
Speaker
the partitioning, how to initiate the mapping, what's going on in the reduce. A lot of distributed systems concepts in order to actually get things done. It was built for that.
Integrating Spark with Data Storage Solutions
00:08:01
Speaker
It was built for folks saying, hey, you don't need all this overhead of
00:08:05
Speaker
how the distributed computing actually works, and we'll give you the API that abstracts away MapReduce operations. It doesn't use the Duke MapReduce, it uses completely different its own software, but it's got the abstraction on top of that, but the brilliant part. So not a lot of people are aware of that. And also, because
00:08:32
Speaker
Bless you. Thank you. It's already part of the organizations. And data scientists today are struggling with getting access to systems, to data, to having their own tools. So they usually being, unfortunately, being deprioritized in terms of workloads and supports from other teams in the company.
00:08:58
Speaker
So they need to be able, I believe the best strategy is what exists in the company? How can I plug in into resources that already exist? And then how can I build on top of it? And Spark is already part of so many data and analytics infrastructure. So it will be smart move for data scientists to plug into what exists, learn about it, use it. And then if they need other tools to support their workflows,
00:09:27
Speaker
they can, but at least they have kind of like the main engine to do their work available for them. So using Spark as the backbone to the architecture. Exactly. We're averaging what exists in the organization instead of trying to
00:09:44
Speaker
bring new things in that we know is very, very hard, especially if those need access to production data, for example, or staging data, or they need to plug into the rest of the architecture that the engineering team is managing.
00:10:07
Speaker
Okay, so I don't know how Spark actually works under the hood. You're saying it can connect to Hadoop, it must connect to Hadoop, and what kind of API does it present, and what Spark like to actually use? Yeah, so Spark has APIs that's available in Python, in Scala, in Java.
00:10:31
Speaker
Yeah, so there's also SQL for folks that prefer to write SQL. And it gives us the ability to run distributed compute. And there are top-level APIs on top of data frames and data sets, again, abstractions of tables, essentially, tabular data, that enables us to operate on top of that tabular data without the need to understand
00:11:01
Speaker
how to manage that compute at scale.
00:11:04
Speaker
which is really, really critical. So we don't need to think about, oh, you know, this chunk of data is going to be processed in that machine, or I need to start that machine with so-and-so parameters and so on. There's no need to do that. So that really simplifies people's lives when we think about, you know, I just need to know the API for tuning and performance. I need to know the internals also, but again, it's more of advanced. First of all, let me kind of put some things together and start working with it.
00:11:34
Speaker
So that's, this is one thing and what was the second question? What's the underlying storage? You said it was Hadoop, is it still? Can you connect to different data sources?
00:11:48
Speaker
Yes, you don't have to use Hadoop. It has a generic connector that anyone can build on top. This is what I believe drove a huge adoption for Spark, the generic approach, saying, hey, you can connect any storage that you want, whether it can be a local file system, if you wish.
00:12:11
Speaker
Data Lakes, S3, Azure Blob, GCP file system, MongoDB, Cassandra, everything that's available in the HDFS and the NoSQL world. MySQL, things that are more in the DBMS side of things.
00:12:33
Speaker
Hive, a lot of systems are still using Apache Hive. I can see myself wanting to connect like analytics from a web server log into some kind of fancy ML model out to maybe like a relational database to query the model from. And that would be straightforward.
00:12:55
Speaker
Yeah, Elasticsearch, OpenSearch, if you think about Analytics, LogEngine, kind of a document file format. Yeah, you can connect with Spark and kind of leverage that. I didn't see a lot of people putting them together, but it is possible. Kafka has one of the best integrations. I think it's like a top-level integration with
Common ML Pipeline Use Cases
00:13:21
Speaker
the open source. Yeah.
00:13:26
Speaker
Okay, well, in that case, tell me a pipeline you have seen commonly put together. Tell me about a use case and which pieces I would use to make it happen. Yeah, so a lot of IOTs like smart cores or devices are leveraging Spark, usually together with Kafka, which is really interesting. It's like they're bringing in new messages or information from sensors,
00:13:54
Speaker
And then they want to make decisions about, for example, what's going on in their organizations or in their factory or in their cars and so on. And so they will bring in events. This is what we call events, being the sensors, images, videos that are breaking down to
00:14:15
Speaker
to images at the end of the day. So we're pulling it in. We're doing some ingestion process. And then we're starting to do the transformations on top of them. Because it comes in a specific format. It might be JSON based. It might be
00:14:32
Speaker
JPGs and so on. And Spark has a connector for binary. For data that it's binary, we can read it. It knows how to take images in as well. It also knows how to work with JSON formats or semi-structured data, which is also great because this is something that we can leverage. And of course, we need to clean that data and we need to give it a solid format at the end of the day. Every single data science project begins with data cleaning.
00:15:01
Speaker
Yeah, it's like, what do you do once in your time? Well, 70% of my time I'm cleaning my data. 10% I'm trying to explain what it is that I'm doing to other folks in the company. And 20% of that glorious, exciting ML work. Always, yeah. Yeah, so that data is accumulated. And usually what a lot of companies will want to do is to
00:15:29
Speaker
translate the business objectives into something that they can pull out of that data, either automation or more information for the user or know if there is an accident with the car or something happened or anomaly detection around everything that the sensors are sensing and make sure that the car is in a good state.
00:15:55
Speaker
autonomous driving is really, really fascinating. Because what it is that they do, they're trying to assess what it's on the road, some of their they have a bunch of machine learning work that they do there, like all the big companies now are rallying and
00:16:09
Speaker
hiring some of the best engineers in data science to do that. Some of their workloads is actually trying to assess what it is on the road. Do they need to stop? Do they need to continue driving? What is the speed limit? What other cars are on the road next to them? Also, this is why they have all the sensors and the cameras around. Yeah, so it's a very extensive market
Practical ML Applications and Feature Engineering
00:16:39
Speaker
And there's always the kind of it's the less headline grabbing one, but it's the one we see absolutely all the time, which is recommendation systems, right? Yeah. Netflix video, should you watch next to is this the synthesizer you're trying to buy next to that kind of thing?
00:16:57
Speaker
Yeah. And, you know, I kind of wish companies would do a better job with recommending things that I didn't buy rather than recommend me the same thing purchased. There's one company that keeps emailing me about other engagement rings I might want to buy. It's like, that's a one time purchase, my friend. Yeah, exactly.
00:17:24
Speaker
Okay, so we've got the pipeline like that, take your use case. Let's get into what I would love to deem fancy maths, because I know your book covers a number of different libraries for doing different kinds of fancy maths on the input data.
00:17:42
Speaker
Yeah, so you take that data in and first of all, you want to understand what is the business problem that you want to solve and what is the domain of the data. So if I know that this data is sensors, what are these sensors? What are they sensing? What would be kind of a normal rate?
00:18:00
Speaker
And then it means either for the individual to be a domain expert and learn the space or bringing in some domain expert to consult with. So first of all, understanding what it is and then thinking through which features would make sense to extract out of that data. Because what we want to do is we always want to combine multiple data sets in order to create better features that explains the situation in the world in a better way.
00:18:29
Speaker
This is essentially what we're going to inject into the algorithm to extract the machine learning model. Just define features for me quickly for those that don't know.
00:18:40
Speaker
When we think about a table, it can be columns, for example. We call them features because they're the features of the world that we're mapping. In machine learning, we name it feature engineering because different than features in software. These features are modeling the world that we want to automate or the problem space. They give us better information than just
00:19:09
Speaker
what we just got from, you know, as raw data. It's like picking out the columns that we think are going to be important for the model. Exactly. Yeah. Columns that will be representative of the world and making decisions. Okay, so we've got a bunch of features we're trying to push into our model. Yeah, and then we're combining it with extensive statistics works to understand if
00:19:39
Speaker
Those are statistically significant. The more data we have, the better our chances of reaching significance in statistics. This is how math works.
00:19:59
Speaker
But there's also a scaling issue in that, right? So you say more features equals better model. But then if you pick out all the features from all of your data, you've got a scaling problem because it takes forever to calculate. Oh, 100%. There's a trade off we need to worry about. There is always a trade off. And
00:20:19
Speaker
There's also a question like, how do I get access to the actual data, right? Or data scientists, like, oh, I got to this organization. How do I get access to the data? This is why plugging into the existing systems is critical. And if the existing system is already leveraging distributed computing, then you can do, like, how about leveraging that and say, I can do all the feature engineering, cleaning, and pre-processing with the distributed computing system that is already in-house.
00:20:49
Speaker
Right. So we're in this world where hope... Yeah, I can imagine a lot of people join a company because the company is excited about doing some kind of ML with its data. And your very first problem is just getting and munging the data in some way. And if they've got an existing system that's spread over multiple machines, that's going to make scaling the pipeline that much easier, right? That's what you're saying.
00:21:16
Speaker
Exactly, because you're plugging into what exists and it's like, okay, I just need access to what already exists in the ecosystem and I don't need to download the data or save it in an unsafe space or other things that data scientists are doing.
00:21:38
Speaker
Especially around security, a lot of companies don't want their employees to download data to their own laptops to process because of IP. It could be sensitive customer data, so there's a lot of gates to even reach that space. So if you can process it in the already safe space,
00:21:57
Speaker
the company kind of already put together, then it would make, you know, data science life much easier because, hey, they are plugging into what exists rather than trying to kind of break through security. Would that involve deploying Spark to the existing cluster in some way, like onto the existing machines or what's the architecture there?
00:22:27
Speaker
No, so that means I would, as a data science, I would get an environment that already has Spark in it. And I might have access to some notebooks or I need to connect through my laptop for notebooks to run things on the Spark cluster. But this is all, this is fine. I mean, this is how people are working. It's like we're connecting to our remote cluster in the cloud so we can run our ad hoc
00:22:53
Speaker
queries and ad hoc work until we figured out what's the right model to build. Then, of course, we want to automate this process. We want to build a machine learning pipeline where we can do it repeatedly as we're doing the experiments until
00:23:13
Speaker
we find a good model, essentially. And then we're moving. Because that's another part of scale, isn't it? Because you don't do this thing where you train one model and go, it's great, ship it to production, you actually have to iterate over and over. And that's a scaling of time. Exactly. And this is, and you also in machine learning, some of the things that you're injecting to the algorithms are hyper parameters and parameters.
Automated ML Pipeline Creation with Spark ML
00:23:37
Speaker
So those are going to change the output as well and the accuracy of the model. So what we're doing, we're building a matrix of all the different combinations of those. So people when they are, so data scientists when they are
00:23:55
Speaker
plugging the data, they can also plug in the metrics, the parametrics that essentially tries out build models that leverage all the different permutations of those parameters and gives us the best model and also the results of all the other models. And this is something that is built in with Spark ML also.
00:24:23
Speaker
Spark ML has the community developed pipelines, machine learning pipelines that are very easy to use and very intuitive. And they were also able to give it all the different processing of the data, including everything that can come to mind, tokenization, hashing, everything that you need.
00:24:42
Speaker
also the algorithm and also the param metrics. And then it knows how to run this whole pipeline together. So you get a fully automated experience. And then at the end of the day, it gives you the results, like what was the best model. Now you can take it and, you know, move it forward in the stack. Okay, so that that actually unpacks a couple of things. Spark ML is an ML package that ships with Spark. Yes.
00:25:09
Speaker
So why are we also talking about things like TensorFlow and what is it PyTorch and MLlib? When do I need those different pieces?
Choosing Between TensorFlow, PyTorch, and Spark ML
00:25:19
Speaker
Yeah, so MLlib and Spark ML are essentially the same thing. MLlib was a library that
00:25:26
Speaker
relatively older library, kind of a legacy library that is still extensively in use and use different APIs of Spark. It's named RDD, resilient distributed data sets that are not being optimized by the Spark query engine versus Spark ML, which is a library that is newer relatively, and it is leveraging the DataFrames APIs. And so DataFrame API is going through
00:25:56
Speaker
and the catalyst, which is the Spark execution
00:26:01
Speaker
optimization execution engine that helps us optimize all the operations that you're running at scale. So these are like, those essentially the same, they're just leveraging different, I guess, software pieces within the software architecture. It's always better to use the Spark ML library. If you can find what you need there, this is great. If not, you can use the ML leave, just be conscious that those won't get optimized
00:26:31
Speaker
using the catalyst because of the hierarchy and the Spark software itself. It's a bit like a query planner in a relational database. Exactly. It's a job planner. What about TensorFlow? Because that's got Google Hotness all over it. 100%.
00:26:54
Speaker
So I did an extensive research before and what I've seen and learned from customers and users and also from my own experience at its spark has state of the art algorithms that were developed in machine learning. However, it doesn't always covers all the stack. So there are relatively new research that came up. There's more advances in the deep learning space in the
00:27:17
Speaker
neural network space. And sometimes Spark scheduler itself, the MapReduce, can become a bottleneck for running compute or advanced compute of a neural network.
00:27:33
Speaker
because what's happening in the neural network, there is a back propagating and a forward propagating. And that means I'm doing a bunch of computations. And then I'm doing forward propagating, I'm taking all the results on these computations and are moving on to the next phase of the computation within
00:27:48
Speaker
my network. But then I want to do back propagation. So I want to go back and actually recompute some of the things because now I realized that I did a mistake. So I want to fix what I did before is kind of like a back and forward in inside these this graph of compute at the end of the day that MapReduce can support, but it doesn't do it in the optimal way. And it has a bunch of
00:28:16
Speaker
bottlenecks and things that the community is still trying to solve. The community did introduce a new scheduler to do that. But again, it's very in Spark, but it's very, very early stage. And PySpark and TensorFlow are more advanced in that. When they started, they focused their effort into neural network processing.
00:28:45
Speaker
kind of the deep learning space, images, and working with text. And so because they were investing in it for so many years, they have better tooling and better system. And then the question was, you know, I want to really leverage what exists in the organization, so I want to go with Spark. And then how can I bridge into other technologies that can enrich my
00:29:10
Speaker
my toolset at the end of the day so I can fully build the models and get the quality that I'm looking for. This book tells the story. It's like, here are all the wonderful things that you can do. Then if you need more tools in your arsenal as you continue to developing and go forward, then here, how you can bridge into these systems as well.
00:29:34
Speaker
running those at scale. So I'm diving deep into how they scale, what it is that you need to do, how it looks behind the scenes, why their architecture is different from each other and also different from the Spark architecture and so on. Yeah. OK.
ML Model Deployment Patterns
00:29:54
Speaker
So you're really going all in on this thesis that Spark is that universal architecture to get in and grow from. Listen, it's everywhere.
00:30:05
Speaker
With me, a company that needs to run distributed computing and advanced analytics and doesn't have a notion of spark. I speak with companies that most of where their work is like logs and analytics and some stuff. And then, oh, you know what? We also have some spark because our BI folks.
00:30:23
Speaker
Needs to have their analytics and the sequel you know the spark sequel works really well on top of all this data so they can make sense of the bi in the organization so even if it's not something that people are leveraging to serve their customers and they they're still like internal analytics or other things that companies are doing so.
00:30:44
Speaker
Yeah, yeah. Getting hold of the data and connecting it and processing it and sending it somewhere else is just utterly universal, right? This is slightly an aside, but Spark sounds like it's in a very similar space to Flink.
00:31:01
Speaker
It is in a way, although Flink was built for streaming, right? There's still notions of MapReduce and in-memory and so on, but it's targeting the streaming space versus Spark also has structured streaming, but it's not the main focus. It has batch processing, structured streaming, SQL ad hoc if people want to, or SQL if they want to. There's machine learning, there is graph, there are graph algorithms.
00:31:32
Speaker
So it's not the same, but it is, I mean, I can understand how people can get confused sometimes about these things, but it's a different technology. Although they do leverage, you know, distributed computing, they do leverage some of the basics of that space, which is always interesting, especially for people who are moving from one tool to the other.
00:32:01
Speaker
it kind of makes sense. Oh, you know, this is how it works in Spark. I can see I can better understand now the flank architecture because I've seen something similar. Right. Okay, yeah. Okay, in that case, let's move onwards out into production. What do you do once Spark has given you a model that you can do lovely predictions with? Yeah, so there's a lot of deployment patterns. And those really depends on what it is that that we need, right? So it can be
00:32:30
Speaker
within a microservice, right? It can be on its own service. It can be part of a batch processing that we're running. It could be part of a stream processing that we're running in production if we have kind of a stream data pipeline. So all of those depends on taking these functions. Where do I need to plug it in?
00:32:50
Speaker
Spark enables us multiple capabilities. If we have already a Spark pipeline and a Spark batch processing, it's easy to plug in, load it, and leverage that as part of the pipeline that I already have. Same thing with streaming. It's easy to plug it in and leverage that.
00:33:07
Speaker
Also, if we're working with tools like MLflow, and I'm covering MLflow also in the book as part of the ecosystem that supports the work that we do, then MLflow enables me to wrap a method to wrap a function and take it and deploy it as a function.
00:33:25
Speaker
And so I can leverage that in microservices environments as well. So as its own, if I want to, you know, put the REST API and kind of have it as its own service, and then I can query my machine learning service at the end of the day, or if I want to have it in service, so attached to an existing
00:33:47
Speaker
server service that's already running. There are some things to think about like around scale and deployment. Like if I am wrapping it within an existing service, that means that my deployment cycles are also going to be attached to the deployment cycles of the actual service. Like if the models and the service are not changing at the same time, it's something to bear in mind. Also, it might be better to
00:34:15
Speaker
to kind of unpair those. Also another thing to think about between those two is the hardware. If I need GPU or if I need some specific hardware to run the model in production to get the result fast, then I might want to also not put them together. I don't want to kind of have that pairing on that.
00:34:40
Speaker
There are pros and cons for each one of them, and it really depends on what I need, what are the release cycles that I have within the organizations, how those are being tested, what's the hardware that I need, because machine learning could be different than the rest of my software. Yeah, if you're generating images from a deep learning model, that's going to need some fancy GPUs, right?
00:35:06
Speaker
Yeah, yeah. And then I want to be as efficient as I can. And I don't want to run any workloads on the same server because I want to be efficient and leverage that for the purpose that I, you know. Yeah, yeah, you don't want every machine in your AWS cluster running the most expensive graphics card, just because some of them need it.
00:35:29
Speaker
Oh no, and then looking at the utility and realizing, oh, I'm only using 5% of my GPU, but I paid for it. I'm certain there are companies out there doing that right now. I hope they buy a copy of your book. Yeah, this is it because we think like deploying ML models is just, I've baked my function and that was the hard part, but getting into production, there's actually quite a chunk of the book you cover dealing with that, right?
Monitoring ML Models in Production
00:35:59
Speaker
getting into production and also monitoring production. When do you know when you need to run another cycle, right? When do you know? That's a really good question. Thank you. I thought of it myself.
00:36:16
Speaker
Um, yeah, so it depends on the model. It's always depends, you know, everyone, uh, it's true. It's, uh, it depends on the use case in the model, but essentially what we were looking into and we can get extremely fancy. It's like, Oh, let's do windowing, how the data is changing, how the, the model, uh, the data that we're injecting into the model and so on.
00:36:36
Speaker
One of the things I've seen work best with companies that I worked with and helped is if you can compare the expected results, which we're looking at the model, the model gives us an expected result with the actual result and then assess accuracy.
00:36:58
Speaker
That would be probably the most cost efficient way to assess the quality of the current model that runs in production. There are more fancy methods like
00:37:13
Speaker
data drift, model drift, business drift, but those requires more heavy processing of the data always. You need to develop specific data pipelines in order to say, oh, there was a complete drift in this data. I see a bunch of anomalies and now it's like the average is not five, now the average is 10. Should I do something about that? Should I actually impact on the model and the results?
00:37:38
Speaker
Which is great, and some companies would go that path if they can't find a better, more optimized way. But if they can, it's usually better to compare the expected versus actual, and then get live feed of accuracy of the model in production, and build as a feedback loop. Give me a concrete example of that, because I'm trying to think... You're not saying... I don't think you're saying...
00:38:07
Speaker
Dave sits there checking that the pictures actually do contain cats. Maybe that's what you are saying, I don't know. So let me give you, let's say I'm taking a ride, okay? I'm ordering a taxi or a cab and they give me kind of an estimation of how long this ride should take. Right.
00:38:29
Speaker
You know, you'll be, you'll be home by 8 PM, for example. Yeah. This is the prediction, right? If you'll take this, this camp now, you'll be home by 8 PM versus if you'll take the bus, uh, you'll be home by 8 30. And then I can decide if, okay, I want to order the camp and then I can, you know, the system can track how actually, like, when did I actually got home to my, or when did I actually got to my destination?
00:38:56
Speaker
So this is a great example where I have the expected and also the actual. And so I can compare the two and know if my model actually delivers on the result or it was completely far off from the actuals. Right, yeah. I suppose recommendation engines would be similar, right? I can show you five things you might be interested in and see if anyone actually clicks on those.
00:39:23
Speaker
Yeah, for example. OK, so we can automate it without Dave looking at cats all day. I mean, you know, if people fancy looking at cats all day, why not? Well, you know, once we've automated it, that frees us up to look at cats. That's the other way of looking at it. I haven't considered that there'd be so much that monitoring would be a part of this, but of course it will be.
00:39:49
Speaker
that's just every large-scale data system these days is intimately worried about monitoring.
00:39:57
Speaker
Yeah, every software in the world should think about monitoring and observability and what is actually happening in the
Explaining ML Model Decisions
00:40:04
Speaker
system. Because when things go south, you want to know before the customer is being impacted, right? It's never fun getting in this call from the customer saying, Oh, you know, you just killed, I don't know, half of my infrastructure. Or, you know, if we're talking about kind of a BTC, people complain or kind of abandoning the application, it's never, it's not a place where we can go to.
00:40:27
Speaker
when we build companies and when we build software.
00:40:33
Speaker
Monitoring is critical and machinery models in production are similar thing. They serve customers and they need to be monitored and even monitored more extensively because of the unknown that it's hard to explain why the model is making decisions. It's kind of like a box that we cannot see through. It's the black boxness of ML that means it's all the more important you see what's coming out as well as going in.
00:40:58
Speaker
Exactly. And explainability is really hard. I can run a bunch of, you know, I can get fresh data, I get it into the system, get results, but not really know why the model got to those specific results. Like what, why the output the way it is. And so observability becomes, and monitoring would be kind of a more critical part of that. So I know when to switch, right? I know when to retire the model and kind of rerun
00:41:27
Speaker
the automation for training another model when deployed that new one to production. This makes me wonder, because as you know, I'm interested in the world of real-time computing, to what degree are we getting to the point where models will be automatically relearning and redeploying? Is it always going to be this two-phase batch thing, or are we going to be constantly improving the models automatically? That's a good question.
00:41:55
Speaker
I believe we'll always need to have humans observing or overseeing the process and ticking boxes, making sure things work well. Some visionary in the industry would say, oh, everything would be automated. But things always go wrong with data being ingested and cleaned.
00:42:22
Speaker
The data itself, the distribution of the data itself is changing, so we need people to tweak the algorithms as well. So I don't see us getting into a place where everything is 100% automated. I do see us getting into a place where more people are
00:42:44
Speaker
able to create that pipeline because of the tools that exist in the ecosystem and the new tools that emerge. So it's definitely, machine learning is a growing space. And we're seeing it, especially now, now it's like booming.
The Human Role in the Future of ML
00:43:03
Speaker
OpenAI has showed that machine learning at scale is critical also in the deployment part and also in the training part.
00:43:12
Speaker
We're going to be doing more and more of it more and more professionally at more production scale. More production scale, better customer service, automating a lot of things that used to be manual work that people did, and enabling people to do more, especially.
00:43:31
Speaker
Well then, in that case, let's bring it back down to the small to finish. If this is something I thought was in my future, should I get started with Spark? You know, if I want to just play around with ML on my laptop, is Spark a good place to start? Or is it something I only take once I'm looking at getting into production?
00:43:52
Speaker
It could be. There are, of course, simpler ways to start with machine learning, but Spark has these really good APIs. And today, if you have Docker installed, there are images that gives you PySpark, so Python Spark plus notebooks, so you can get started right away. And I believe it's a better tool to learn with because you actually gain the experience with tools
00:44:21
Speaker
available and are used in production. So it goes beyond just, you know, producing a model for the sake of producing a model, but actually learning a tool, you know, throughout producing a model, you're actually learning a tool that will be useful for you for, you know, for a career. So do you think it's worth that slight overhead to go to instead of learning to build toys to build things that could potentially be production worthy?
00:44:49
Speaker
Yeah, and people can still build toys. I have a Docker with BuySpark and Notebook. I can launch it on my laptop, and I'm building something. It's not scalable. I can't ingest tons of data into it because I'm running it on one machine. I didn't connect it to any cluster that actually runs to distribute computing. But the nice thing about it is the same code.
00:45:15
Speaker
connected it to a distributed cluster with RAN. Nice part of it. You can take it out into production without massive code changes. That's always a great thing to have a starting point, right?
Book Availability and Upcoming Conference
00:45:30
Speaker
Most people don't have that starting point. Most people, the second they want to go into production, they're blocked on re-engineering. This is one of the challenges also with data scientists. They create these great models, but they're using tools that
00:45:45
Speaker
You know, the rest of the team don't know how to take the production, so... So, pick up Spark and a copy of Adee's new book. Yeah, the book was sold out on Amazon, but I think it now should be back in stock. Yes, I know. I got a couple of messages from people saying, I really can't find a book because it's sold out. That's a nice message to get, as problems go as an author.
00:46:11
Speaker
Yeah, yeah, and you know, I found myself, you know, reaching out to the Riley team. It's like, hey, how can we do? How can we help them? How can they get a message when it's back in stock and so on? So yeah, now it's back. So people, people can go if they want a hard copy, it's available now in the hard copy as well.
00:46:33
Speaker
Cool. Well, I think there's a good chance I might see you in person next week at a conference who might both be at. So hopefully I can get a signed copy. Yes, 100% will be my pleasure. Adee, thank you very much for talking to us. Chris, thank you so much. It was a lot of fun. And I hope you enjoy it as well. Cheers. See you soon. See you soon.
00:46:58
Speaker
Thank you Adee. If you'd like to learn more, I'll put a link to her book in the show notes, and a link to Apache Spark if you can't wait that long to dive in. Before you head off, if you've enjoyed this episode, please take a moment to share it, tweet it, rate it, review it, subscribe to it, and thumbs up it, if I can use thumbs up as a verb, which I just did.
00:47:22
Speaker
Because all that stuff is the feedback loop that helps me feed forward into future episodes, so it really helps. But until then, I've been your host, Chris Jenkins. This has been Developer Voices with Adi Polak. Thanks for listening.