Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Episode #68: Randy Olson image

Episode #68: Randy Olson

The PolicyViz Podcast
Avatar
166 Plays8 years ago

Welcome to the final episode of The PolicyViz Podcast of 2016! Across 39 episodes this year, we’ve covered open data, data visualization, data workflow issues, presentation skills and design, and more! Thank you for tuning in each week! Your support...

The post Episode #68: Randy Olson appeared first on PolicyViz.

Recommended
Transcript

Introduction to Randy Olson

00:00:11
Speaker
Welcome back to the PolicyViz podcast. I'm your host, John Schwabisch. I hope you're out there having a great holiday season and thanks for tuning back in because on this week's show, I am very excited to have Randy Olson, data visualization, machine learning, expert tweeter extraordinaire. Randy, welcome to the show. Hey,

Randy's Background in Data Science

00:00:31
Speaker
I'm excited to be here. Thanks for having me. Yeah, thanks for coming on. How is your holiday season starting off?
00:00:37
Speaker
Going pretty well. Just taking it easy this week. It's really, really cold here in Philadelphia. Sorry if I get the shivers while I'm talking. It's even cold in this room. Finally getting a taste of winter. Soon we'll be complaining about it's too cold and then we'll be complaining it's too hot. That's just the way it is, I guess. The circle of life. The circle of life, that's right.
00:01:01
Speaker
We've got a few big things I want to dig into because you have a really cool project on National Parks, and then do a lot of interesting research on machine learning, and I want to talk about both of those. But before I dive into

Journey into Data Visualization and Blogging

00:01:11
Speaker
some specific questions, can you maybe introduce yourself for folks and give a little bit of your background? Yeah, sure. I can give a brief synopsis, I guess, if I am. So yeah, I think database and machine learning guru is a good way to explain myself. Tweeter extraordinaire, sure.
00:01:30
Speaker
you'll take it, right? Yeah, I've started running out of time for that lately, but I still try to keep on top of it. But yeah, so basically, formerly, I'm a senior data scientist at the University of Pennsylvania Institute for Biomedical Informatics.
00:01:44
Speaker
And that's basically a fancy title for a really cool job that I have here, where I basically get to spend all of my time developing machine learning software, developing machine learning methods, and applying them to biomedical data at the University of Pennsylvania. So I'm basically just, you know, geeking out 24-7 or close to it over data and machine learning and data viz, you know, all the stuff I ever dreamed of.
00:02:12
Speaker
When I was in grad school, which by the way leads I was a PhD student at Michigan State University I will still say go green even though this has been a really terrible season for us in football. I'm so proud Spartan Yeah, I studied computer science there studied computer science in undergrad So basically a decade straight of computer science education building up to the ultimate nerdy job now Yeah, also on the side When I was in grad school, I started a hobby
00:02:42
Speaker
of blogging about data and data visualization and eventually machine learning and that led to I presume some of the projects we'll talk about here. Yeah, absolutely. And I'm also interested in talking about how your work in the computer science lab overlaps with the health providers I guess as it were at the rest of the university.

Optimal Road Trip Project

00:03:03
Speaker
Before we dive into that, I want to talk about one of your latest projects, I guess, would be your optimal road trip to the US national parks project, which was super fun. And I've been exploring it to see how I can take the kids on a fun national parks tour the fastest way possible, because I'm not sure how many hours in the car I can handle with both kids. But do you want to talk a little bit about that? And what was the inspiration and how you developed it?
00:03:30
Speaker
Yeah, sure. I mean, basically, you know, I mean, it's the centennial for the National Park Service. And so, you know, I wanted to do something to help celebrate that, to help encourage people to, you know, get out there and even visit, you know, even just a few of the parks and see, you know, the really
00:03:50
Speaker
beautiful parks that we've organized over the years. And so somebody gave me this idea, you know, I've become a little bit famous or infamous or whatever you want to call it, or designing these these optimal road trips that can hit a whole bunch of locations in the fastest time possible. And someone suggested, oh, why don't you try to do that for all the national parks? So that's basically what this trip is, you know, as it tries to hit every single national park in the US.
00:04:17
Speaker
as quickly as possible, meaning as little driving as possible. And it ends up being this epic multi-day trip that I think would at least take you a couple months to finish. But wow, it would be an amazing trip. I wish I had the time to go on it. And how did you develop the algorithm to, I guess, minimize the driving time and maximize the adventure?
00:04:40
Speaker
Yeah, so that started, gosh, has it been almost two years now? Wow, time flies. It has a bit of a funny beginning because I originally developed that algorithm when I was writing up my dissertation for my PhD. And as with all PhD students, I was looking for a distraction from writing my dissertation. And so one weekend when it was really snowy in Michigan, which tends to happen a lot during the winter,
00:05:07
Speaker
I was snowed in for the weekend. My weekend plans were canceled, so I turned to what else but my data hobby, my data blogging hobby. And I found this really cool article on Slate.com that talked about finding Waldo from the Where is Waldo Children's books as quickly as possible using math.
00:05:30
Speaker
And they came up with a pretty good technique for doing that and I was pretty impressed. But at the same time, I had been studying and applying machine learning for several years at that point and I said, you know what? I bet I can reach into my bag of machine learning tricks and do better than that.
00:05:46
Speaker
So that's exactly what I did that weekend. I basically I treated finding Waldo as a traveling salesman problem, trying to sort of optimize that path on finding him as quickly as possible. And so using this machine learning algorithm, I was able to do that and effectively solve Waldo. I was able to discover a path across the Where's Waldo pages. I remember it was an average of 10 seconds or so.
00:06:12
Speaker
So that was a pretty cool result because, I mean, you know, I basically solved an age-old problem that at least I grew up with of finding Waldo. He always, you know, he always hid from me. And now, you know, when I have kids, I can impress them and be like, look at how cool dad is. He can find Waldo in like 10 seconds. You know, dad is amazing. And so I shared that as a blog post.
00:06:39
Speaker
And then I was contacted by a reporter at Discovery News, Tracy Seder, the next weekend. And she said, you know, hey, this is this problem that you're solving here, this traveling salesman problem where you're trying to hit, you know, every point on the page as quickly as possible. This is very similar to what we do when we design road trips.
00:07:00
Speaker
I mean, when some of us sign road trips, you know, we choose where we want to go. And then we, unless we really, really love driving, we try to find the shortest path possible to kick down all those places.
00:07:13
Speaker
I said, yeah, that sounds exactly like the same kind of problem. And so I spent the next weekend, even though it wasn't snowed in that weekend, working on that. And I came up with this algorithm to optimize road trips using Google Maps directions.
00:07:31
Speaker
Right. What I love about the post is your walkthrough and the image, of course, the Google Maps image, but the comments are probably the best part. People debating which parks are worth including or excluding, how much it would cost to do all the parks. Now, how many of these parks have you actually visited?
00:07:50
Speaker
Unfortunately, I've only visited a few myself. But I keep designing these road trips, these epic road trips, and being like, I'm going to do it one day. I'm going to do it one day. But it is very difficult to get. Most of these road trips require at least a couple of months to do them in full. And so a follow-up post to that original one actually focused on how can you optimize
00:08:15
Speaker
you know, subsets of these trips, you know, like, let's say you have 48 hours, you have a weekend, you know, and you want to do a bit the biggest road trip possible out of that, you can do that as well. Right. Yeah, so I haven't got to do much traveling myself, but there have been, I've already been contacted by a couple people who have done at least a large portion of this national parks road trip.
00:08:37
Speaker
And at the original road trip that was hitting a major landmark in every state, at least a dozen people have done that now. It's really amazing. People see these maps, these trips on the internet, and they say, I'm going to do that. Those people are really inspiring to me. Now, what would it take for you to modify this to do, say, the optimal road trip for... You've seen these a lot of optimal road trip for going to every baseball stadium during the season or every football stadium.
00:09:06
Speaker
How would you think about doing that when you're adding an element where you can only be in Seattle or in San Francisco at a particular time? That's a very, very similar problem except basically adding time constraints.
00:09:20
Speaker
So there's actually, funny enough, another article on Slate that does exactly that. Using a different method, but they're able to find an optimized road trip under these time constraints as well. But it's certainly possible. I mean, all these algorithms in some sense are trying to optimize some criteria. For my algorithm, it's minimizing the number of miles that you drive.
00:09:47
Speaker
And you can sort of make that a multi-objective problem where you say, you know, minimize the number of miles that we drive, but also take these time constraints into account as well. You know, we can't go back in time to go to a game that happened last week or something. So, yeah, that's a big field of research, but it certainly complicates the problem because you can't just, you know, visit any place at any time of the year anymore.
00:10:14
Speaker
So let's turn to some more of your full-time research, not just the fun stuff, although I guess it sounds like your full-time research is the fun stuff as well. Can you talk a little bit about the machine learning types of work that you do? And you had mentioned that it's biomedical stuff, so how much does it overlap or how much do you interact with, I should say, with the folks who are actually in the Penn Hospital and using some of the tools and techniques that you are developing?

Machine Learning in Healthcare

00:10:42
Speaker
Yeah, sure. Yeah. So basically, the, you know, the road trip stuff was one kind of machine learning algorithm and optimization algorithm that I like to use. In my day to day work, I use more standard statistical machine learning methods. You know, folks who are familiar with machine learning, things like random forests or logistic regression, all that fun stuff.
00:11:04
Speaker
Yeah, so the work that I do, I actually work a fair bit with pen physicians. We actually, as a part of this institute that I'm a part of, we have this whole room called the Idea Factory. Basically, the whole point of this room is for pen physicians, pen researchers to come in with their datasets or with issues, and we have a whole bunch of data scientists, including myself there, to bring that technical expertise to their research.
00:11:34
Speaker
So they obviously have way, way more experience than us than working with the actual patients and working with the data. But in terms of actually applying machine learning to it, that's where we come in. We help them design the experiments, how to properly sample the data, doing the initial exploratory data analysis.
00:11:57
Speaker
doing some basic modeling, all that stuff. So we work quite a bit with the Penn data. But as far as collecting the data, I have to say, for me at least, thankfully, I don't have to work much with the patients.
00:12:12
Speaker
There's definitely a big focus on actually building tools that are actually useful for the hospital. Building a model that can predict the need for beds in the emergency room, for example. A lot of the things
00:12:28
Speaker
not just at Penn, but at the hospital systems all around the country, they sort of just use rules of thumb or heuristics about guessing when a patient needs extra care or when we have to adjust something about the hospital. So now we're trying to use machine learning, trying to use a data-driven approach to improve those decision-making processes. So it's pretty cool because my PhD research focused on a lot of theoretical stuff.
00:12:56
Speaker
But now I actually get to build machine applications that have direct uses for the Penn Hospital and also improving people's health care at Penn.
00:13:08
Speaker
It's been a really nice position so far. Yeah, very applied. So are you pulling in then real time data? You have a patient who's in the ER and you're trying to predict, you know, something health wise, you know, moving up to the ICU, moving out of the hospital. So are you pulling in real time biometric data and running that through an algorithm and constantly updating it?
00:13:30
Speaker
I know so so far we have not been working on unreal time stuff in our Institute. However, there is a really good data science team working directly within the hospital as well. Who they are doing that they built this pen signals application that they're integrating directly into the hospital system.
00:13:49
Speaker
that processes information in real time, makes predictions about the patient's health, and also even just does basic things like providing the number of tests that a patient has had.
00:14:04
Speaker
and stuff like that to the history. Things that normally is just a stack of paper that the physician has to look through, now they have a nice computer interface for that. But as far as our work, most of the time we gather the data ahead of time, and even we design an experiment and gather the data in a specific way, and then we start modeling around that. So we try to account for things like diseases tend to be, most diseases tend to be fairly rare in the population, so we have to
00:14:33
Speaker
you know, account for class imbalances in the data and things like that. So we usually don't work with real-time data, at least not yet, but that doesn't mean we won't in the near future.

Communicating Uncertainty to the Public

00:14:45
Speaker
Right. You know, one of the things we've seen recently with the election and the discussion after the election, at least with regards to polling, is this
00:14:54
Speaker
idea of uncertainty and how we convey uncertainty and whether people understand uncertainty. Do you find that that is an issue you need to tackle when you are trying to build these algorithms and communicate patient information to doctors where there's an X percent chance that this patient could do this or need this and yet there's uncertainty around that?
00:15:16
Speaker
Well, I mean, so as far as communicating that information to the physicians, thankfully, that's really easy. That's sort of a world that they live in, it seems, which has been a relief. It's very standard terminology to say that someone is at risk or not at risk.
00:15:33
Speaker
that it's implied that there's a probability there. Just because you're at risk doesn't mean you're going to develop this disease, right? And then there's a gray area there where you can have certain levels of being at risk and so on and so forth. So as far as communicating that information to the physicians and researchers, it's been pretty easy.
00:15:53
Speaker
But I mean, obviously communicating that to the general populace is much, much harder. You know, one of the things that I tweeted about the election is, you know, gosh, what were the probabilities of Trump winning the day prior to the election, like 30%? Yeah, something like I think it was like 15, 20% or something like that, according to the upshot, I believe.
00:16:15
Speaker
Yeah, so I mean, yeah, there are there are numbers all over the place, but all of them were fairly low. And my, you know, my tweet was finding out that, you know, just because it says 15%, 20%, it doesn't mean it's impossible, right? You know, it's not out of the window. And so we had a first hand lesson during the election there that rare events can happen. Yeah. So yeah, I think I think there's still it's still a tremendous challenge.
00:16:43
Speaker
communicating the concept of probability to the general population. I won't sit on a high horse here and pretend that I don't fall susceptible to it too. It takes a certain mindset to think probabilistically, to think when someone says you're at risk of some disease, you're not doomed.
00:17:08
Speaker
And personally, I've still fallen, you know, prey to that. Well, I mean, every day when we look at the weather forecast, if I see 10% chance of rain, 20% chance of rain, I say, yeah, it's not going to rain. And then I get pissed off when it does rain and I didn't bring my umbrella.
00:17:24
Speaker
You know, so we live in a probabilistic world and so it's takes practice to think from yeah, absolutely Are you also thinking about hardware when you're developing the algorithms or is that someone else's job? So I'm thinking of in the healthcare
00:17:41
Speaker
There's obviously been a lot of change in the software and the technology that they're using, but they're also moving. I would guess in my fortunate brief interactions with the healthcare field that moving from stationary desktop to computers to laptops to mobile tablet devices. So are you thinking about how people would use your dashboards or your visualizations in those sorts of different platforms?
00:18:09
Speaker
Oh yeah. So a couple things to say there. So yeah, absolutely. We're considering hardware because we can't always expect people to have really powerful machines right there when they're dealing with the patients. Sometimes they're just using tablets or whatever else. So in this idea factor, we focus largely on web-based interfaces.
00:18:31
Speaker
The people can log into and interact with and you know make it very very swipey and very interactive. So absolutely that's that's been a major consideration. And another thing is is that's been particularly exciting here is that we have a large fund for building an actual computing center behind this whole.
00:18:53
Speaker
this whole visualization room. And so one of the greatest things that we've had is the ability to build an AI behind everything that's going on here. So sort of an AI assistant, almost on the level of, well, if you've seen the movie, Idiocracy, you know, they have the sort of AI doctor, you know, you plug in something to like your mouth and something in your butt, and then it gives you, it's not on that level. But you know, that's, that's,
00:19:22
Speaker
I won't say that's what we're building towards, but think of something like that. An AI assistant that when people bring in their data sets or whatever else, they can help them along the way, applying the right machine learning methods or whatever else. In addition to considering the fact that people are going to be using our tools on a tablet or a phone, which has decent computing power but not the computing power that we need for machine learning,
00:19:51
Speaker
We're also dealing with the fact that we do have algorithms that can use a lot of computing power, and we have that as well on the back end. So it's quite nice building a very lightweight front end that works on all kinds of hardware, but also connects into a much more powerful computing system in the back end.
00:20:11
Speaker
Really interesting. Before we finish up, I wanted to ask you where you think machine learning and the sort of work that you're doing is going to be in the next year or five years.

Future of Automated Machine Learning

00:20:24
Speaker
Yeah, sure. One thing that's actually directly relevant to my research, I think, is there's been this huge growth in this idea of automated machine learning. Basically, automated machine learning is this idea that
00:20:40
Speaker
Machine learning is quite powerful, right? And then it allows computers and machines to learn from data and make predictions from that data. But it's actually still pretty laborious to actually design those machine learning algorithms, to code them up, to select the right parameters, to process the data in the right way.
00:21:01
Speaker
So we've put a tremendous effort in our institute here into this idea of automated machine learning where we have sort of a meta algorithm on top of this entire machine learning and also the full data science workflow.
00:21:16
Speaker
to automate that entire process of processing the data all the way to modeling it and delivering that final model to production. So I think there's going to be a tremendous focus, especially coming up over the next couple of years, improving upon these automated machine learning algorithms because they're going to relieve
00:21:35
Speaker
machine learning researchers and machine learning practitioners from the sort of boring process of doing the really basic stuff of selecting the right model optimizing its parameters yada yada yada and freeing them to work on the really the harder problem of machine learning which is thinking in the right terms you know how do we pose this as a data problem how do we pose this as a machine learning problem and how do we do
00:22:00
Speaker
in a way that it's actually useful for my company or for my institute or whatever else. So I think that's going to be a tremendous focus and we've already seen even just this year some really nice progress in this field where we have these automated machine learning systems actually beating hundreds of machine learning experts on let's say Kaggle problems or other machine learning competition problems. So it's really really promising, really really exciting.
00:22:29
Speaker
And so just Google up automated machine learning if you want to learn more about that, I suppose. Yeah, that sounds fascinating. And it sounds exciting. I mean, it sounds exciting for what we're able to see over the next few years. So that's great. Randy, thanks so much for coming on the show. This has been a lot of fun.
00:22:45
Speaker
Yeah, it's been great chatting with you. And thanks, everyone, for listening. If you want to take your optimal road trip to the US National Parks, you should go check out Randy's site at randallolson.com. And I'll put a link to that post and the rest of his website on the show notes. So until next time, this has been the Policy Viz Podcast. Thanks so much for listening.