Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Max Kuhn Shows You How to Model Data in R image

Max Kuhn Shows You How to Model Data in R

S9 E227 · The PolicyViz Podcast
Avatar
729 Plays2 years ago

Max Kuhn is a software engineer at RStudio. He is currently working on improving R’s modeling capabilities and maintains about 30 packages, including caret. He was a Senior Director of Nonclinical Statistics at Pfizer Global R&D in Connecticut. He was applying models in the pharmaceutical and diagnostic industries for over 18 years. Max has a Ph.D. in Biostatistics. He, and Kjell Johnson, wrote the book Applied Predictive Modeling, which won the Ziegel award from the American Statistical Association, which recognizes the best book reviewed in Technometrics in 2015. Their second book, Feature Engineering and Selection, was published in 2019 and his book with Julia Silge, Tidy Models with R, was published in 2022.

Episode Notes

Website at RStudio: https://www.rstudio.com/authors/max-kuhn/
Twitter: https://twitter.com/topepos
Github: https://github.com/topepo 

R Packages:
autoML
caret
Quarto
RMarkdown
tidymodels
tidyverse

Books from Max:
Tidy Modeling with R: A Framework for Modeling in the Tidyverse
Applied Predictive Modeling
Feature Engineering and Selection

R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Garrett Grolemund and Hadley Wickham

Related Episodes

Episode #225: Julia Silge
Episode #212: Dr. Cedric Scherer
Episode #210: Dr. Tyler Morgan-Wall
Episode #207: Tom Mock
Episode #150: Learning R
Episode #69: Hadley Wickham

iTunes

Recommended
Transcript

Introduction with Max Kuhn

00:00:12
Speaker
Welcome back to the PolicyViz Podcast. I'm your host, John Schwabish. On this week's episode of the show, I welcome Max Kuhn. Max is the author of the new book, Tidy Models with Julia Silge. Now, if you're a listener of the show, you may have listened to that episode with Julia just a few weeks ago.
00:00:29
Speaker
Julia really didn't want to talk about her new book, which is amazing. She wanted to just move on to the next thing that she's working on at RStudio, which is great. But I want to talk more about tidy models, so I reached out to Max to see if he would like to talk about the book. So we do talk about the book and we talk about tidy models and we talk about a lot of the other fascinating things that he's working on at RStudio. We also talk about his background in the pharmaceutical area and how he moved into R and into RStudio.
00:00:55
Speaker
There's a lot to learn here. There's a lot of great links in the show notes that I've included if you want to explore some of these different packages and explore Max's GitHub page where he's got a lot of information, a lot of code, a lot of resources for you. There's a really interesting episode, a really interesting discussion, and if you are an R programmer, this is the episode for

About the Sponsor: Partner Hero

00:01:16
Speaker
you.
00:01:16
Speaker
Now, before I turn you over to that interview, I'm excited to let you know that I have a sponsor for the podcast. That's right. Partner Hero is sponsoring this and several more podcasts coming up. So let me tell you a little bit about them.
00:01:33
Speaker
before I let you go on to the next thing because I think if you are a freelancer, if you're a small business owner, if you're working in the database field, Partner Hero might have the right solution for you. Partner Hero is a customer service outsourcing firm. They have flexible terms. They'll help you scale quickly. They have quality assurance.
00:01:53
Speaker
programs baked right into the tool and the offices all around the world, which of course is really important. Now I've used lots of different outsourcing tools and platforms to do some of my work. You know, maybe I need someone to help me scrape something off the web. Maybe I need someone to help me write something or clean something up. And they're all fine. They have some pros and cons. Some of the user interfaces aren't great. This and that. The thing that I like most about Partner Hero
00:02:17
Speaker
is that they are really emphasizing values and they are values aligned and trying to change the outsourcing industry because there is so much work out there where people are being exploited, people are being taken advantage of, and Partner Hero is really focusing on that.
00:02:33
Speaker
So if you are a small business owner or if you're a freelancer, generally, if you're just ready to bring in outside customer support help for your startup that feels like it's part of your existing team, I recommend you check out Partner Hero. So head on over to partnerhero.com slash policyviz to book a free consultation with their solutions team. Mention you heard about Partner Hero from policyviz and they'll waive the setup fee. So that's partnerhero.com slash policyviz to check out Partner Heroes
00:03:03
Speaker
outsourcing firm. So with no further ado, let's check out the interview with Max Kuhn from RStudio.

Max Kuhn's Background in Biostatistics

00:03:10
Speaker
I hope you'll enjoy the conversation. Hey Max, good afternoon. How are you? I'm well. How are you doing? I'm good. You've got a friend behind you. Yeah, that is my new puppy Kaladin. He's six months old. I've had him for like two weeks and he's like the platonic idea of a good boy, you know.
00:03:31
Speaker
He's literally Max's best friend. He's just chewing on a, what is it? One of those rubber things with the peanut butter in it. Yep. Yep. Good. Enjoying it. Um, well, thanks for coming on the show. You have this new book out that I'm excited about, which I am just checking out now and there's a lot in there. So I want to get to that, but I wanted to start with your background because like a lot of folks who come on the show working in data and data visualization.
00:03:58
Speaker
It's not like a direct line to where they are now. And you have an interesting background. So I was curious if you could just talk about yourself so folks know a little bit more about you. Where you started and how you ended up at RStudio? Yeah, so I'm a biostatistician. So I have a PhD in that I worked for about six years in Baltimore for a company doing molecular diagnostics. That was a lot of like traditional nonclinical statistical work like
00:04:25
Speaker
designing and analyzing experiments and then doing some algorithm work for the instruments.

Development of the Carrot Package

00:04:30
Speaker
And then I went into drug discovery for about 12 years, again, doing like the early research sciencey stuff and did a lot of modeling and experimentation and things like that. I really enjoyed it. So yeah, when I left there, it was like a little bit of push, a little bit of pull. I honestly, I was not
00:04:49
Speaker
Thrilled with working for a huge corporation, maybe goals or maybe not alignment with mine. And the scientists I work with and the work I did, I loved and I really, really enjoyed all that. But, you know, it was kind of like, I think I'd reached my period of like, yeah, I need a new job. And so there was a little bit like, yeah, I should look for something else. But a big part of it was when J.J. Lehrer and Hadley Wicker were like, hey, we're going to be doing some more stuff with R.
00:05:18
Speaker
Would you like to work on modeling software? And I was like, yeah. So in the first week or two of working in Discovery, I had had some ideas on an art package. So this is like 2005. And there weren't many. Art was a very heterogeneous environment. It still is for modeling. And so I wrote this package called Carrot, which was mostly used internally for a year or two for computational biology and chemistry.
00:05:47
Speaker
But it was like something that would unify a lot of the disparate interfaces and things like that. And also had a lot of functions that didn't really exist. Like if you wanted to calculate sensitivity and specificity to support functions. And with that, and I was working on a book on modeling. And that was going pretty well. It was like a little pet project that kind of blew up on me. And I was like, holy smokes, now people are using it. And then the company was like, yeah, yeah, I know we allocated some time for you to do that.
00:06:17
Speaker
You're kind of on your own now for doing that. So with Carrot, I love it. I'm happy so many people have used it and have gotten a lot out of it, but it maybe wasn't designed as time went by very well. It was like something I was doing in my spare time.
00:06:34
Speaker
So it didn't really have much potential for when he's got a lot of stuff in it, but not, it wasn't very extensible. And so, you know, when they say like, Hey, what would you do if you could spend all day writing modeling software? And I was like, well, I'd kind of start over. Um, there's a lot we learned about interfaces to models since then, um, especially with Hadley and all this stuff he's done with the tiny verse and gg, things like that. So yeah, so there was a very enticing offer to be paid full time to write
00:07:03
Speaker
data analysis software and start over and work with a lot of really good people who know a lot of really good stuff and let things evolve. So did you go to RStudio in 2005? I think it was 2016. No, 2005 was when I started in drug discovery. Oh, okay. Okay. So you've been at RStudio now for about
00:07:24
Speaker
about seven, eight years, something like that. It was like the end of 2016. Okay.

Conceptualization of Tidy Models

00:07:30
Speaker
Yeah. I mean, that's really interesting. So, so then, so where does tidy models come from in your, your origin story?
00:07:37
Speaker
as it were. I thought about different components of what I would want to do next. It happened really weird that I happened to be in New York the same day that they were in New York. And they knew I was in Connecticut, so they were like, hey, could you come to New York? And I'm like, what street are you on? And so we actually sat down on a whiteboard and
00:07:57
Speaker
and outlined some stuff. And I had originally had been thinking about what became the recipes package, which I'm really particularly proud of. Recipes is sort of like a combination of a dplyr and R's form of the method. And what that means, if you're not familiar with those, is you could very quickly have a very expressive, sequential way of preprocessing your data prior to modeling or doing feature engineering or feature extraction.
00:08:25
Speaker
and allow you to do things that you couldn't necessarily do with ours, traditional modeling tools. And so, you know, and that's very, very much influenced by the, you know, pipeable functions and deep hire and things like that. So, you know, we kind of thought about that. And that was sort of like the first little bit of it. You know, I'd read R for data science and some things that Howie was doing in there that didn't really persist beyond that, but the way that he stored like resampling information, stuff like that.
00:08:53
Speaker
um, eventually became our sample, sort of like a beefed up version of that. And so these little pieces sort of came together. Um, you know, what do we do about having a better interface to models? And there were things I definitely learned with care. Like, yeah, I'm not going to do that again. It became a little bit more complicated way of doing things behind the scenes, but in the end, I think it's a lot simpler for, um,
00:09:18
Speaker
Well, I think at least it's a lot simpler for people to use. So it was just sort of like, yeah, you know, and I remember the first time, like right when I started was like the first door studio conference and somebody asked Hadley on stage, like what's Max going to be working on? And he said like modeling and they're like, Oh, what kind? He's like all of it. And I'm literally sitting, Jenny Brad and I had just met because we kind of started the same place. We're sitting next to each other and I was like,
00:09:42
Speaker
I'll clean this podcast is, but, but, you know, it's very open-ended and thankfully it still is. And, you know, and they trusted me enough to say, like, all right, well, let's, you know, give him a hand when he needs it. Like we, we usually talk a lot about interfaces and things. So like, what's a good way of doing things. Yeah. I want to ask what drew you to R in the first place, as opposed to, I don't know any of the bio stats packages. I'm sure there's a ton of them, but like, what, what drew you to R?

Switching from S+ and SAS to R

00:10:09
Speaker
Well, I'm going to date myself. I was in graduate school in the 90s, so basically your two choices there for statistical analysis was SAS and S+. There was no R, and SAS was what people were being taught. It became very clear to me that, you know,
00:10:27
Speaker
you're very limited in what you can do. And the reason I went into nonclinical statistics was the nature of that kind of job is you have like hundreds of customers doing many different things that don't have any predefined analysis. So somebody comes to you with some new laboratory tests that they're working on that produces some really funky type of output and it's your job to translate that into some numbers. And so, and I always gravitated to problems like that. So I felt having something very like, um,
00:10:57
Speaker
It's going to sound bad. I don't mean it to be bad, but like a superficial sort of programming language like SAS, it's not very expressive. And so, and so in graduate school, I saw S plus and I was like, all right, you know, this is nice. Right. And so eventually S plus sort of petered out when R came online. And so, you know, R for me, I mean, it sounds like a silly thing to say now because we're so
00:11:20
Speaker
Uh, used to having actual programming languages to work with, but in the mid nineties, you know, that wasn't right. So, you know, having something where you had scoping and functions and data types and, you know, and, and that was like, uh, it was like, yeah, I can do anything I want to hear. Um, so, I mean, so, you know, I think in the, in the language or some data science, I think most of it's based on where you started and it's like, I've never used Python or anything like that, but yeah.
00:11:46
Speaker
You know, R, I really do believe that R clearly abides, but like the S language in R is really based for people whose fundamental thinking process is about interactive data analysis, right? So it's built from literally from the ground up with that in mind, which is not a knock against any other programming language. But if you're asking me like what I feel makes it click with me, it's like, oh, you know, there's some things that could look really kludgy to outside people
00:12:14
Speaker
but they're really nice in terms of the context of what you're doing. So contextually, you can call it like a DSL or domain specific language or something like that.
00:12:25
Speaker
But I like it for what it does. I wouldn't use it to like do my taxes, but no, no, right. For Dan analysis, it's right. Yeah. For the tool. Yeah. And I guess I don't know enough about the history of our, to be perfectly honest, but the movement from base are into using a GUI like our studio. I don't even know what my question is here, I guess, but like, how did that make maybe the question is how did that change your.
00:12:51
Speaker
Maybe your workflow or the way, or maybe it didn't, or how you think about using the tool when it has more of this space. That's a little bit. Well, I don't know. That's probably true, right? Just use more user friendly.
00:13:04
Speaker
So I think there's like two aspects of that that are worth talking about. And first is syntax. And, you know, people are still arguing about this. You know, being like an S-person, you know, back in the day, when I first started using it, I remember sitting in my office and this is in Richmond, Virginia, in the basement of the medical building and saying like, where's the, where's the damn inverse function? I just need to inverse this matrix, right? It's solve. It's, you know, it's a solve function without adding the extra argument. Yeah.
00:13:33
Speaker
And I'm like, what the hell? Or like the sort function. I want to sort a data frame. And no, you can't really do that. You have to use subscripting with the order function. And so it's really efficient and does some good things. I mean, it does a lot of good things. But it's not really written in terms of like, I could figure that out. I have a PhD in statistics and done all this stuff. But I've worked with a lot of bench scientists, and I've worked with a lot of people who don't have any training in data analysis or statistics from computer science.
00:14:01
Speaker
Right. And I think the things like the tidy verse are born out of that. Like there's some just low hanging fruit about let's name things better. Let's give them arguments that, you know, and then you consider like the pipe, you know, the Magruder pipe or the now the base R pipe and start designing code for that. You know, people are still arguing about this, but I feel like for the average user, it's far and away better to be using, you know,
00:14:26
Speaker
Assuming you're inside the scope of what, let's say, the tidyverse or associate things do, it's a much nicer place to live.

RStudio IDE and Tooling Significance

00:14:34
Speaker
So on the syntax side, I think that's a good answer. I think the sort of, you know, tidyverse gets a lot of press, but I think the thing for me, as a developer,
00:14:43
Speaker
And this translates to the average user too is the tool sets are so good. They're so good. Like the RStudio editor, like the IDE, the tooling around almost everything is, it just can't be better. And so the thing I've learned over the years is, you know, it's not that you need really, really good tools to be great at something.
00:15:05
Speaker
But man, does it make your life so much better. Like I can't imagine doing pack management without the tools that the people inside of our studio build. Think about users, like in terms of like importing data. The tools are largely, I think a lot of the tools I'm talking about are really born out of our studio. Here's a great example is Jenny Bryan has spent a lot of time working on spreadsheets. You know, she's this brilliant person. And I think she recognizes like, yeah, all that fancy statistics is great. But if people can't get their data in, like,
00:15:33
Speaker
Why are we doing it? I remember talking to her about spreadsheets and she and I had this sort of shared miserable experience of having people... So in biology, most genes can... There's different ways to reference genes, but for humans, there's this thing called the Hugo ID, which is usually a couple of letters, like interleukin 18 is IL 18. And I had this experience where people
00:15:56
Speaker
external collaborators would give data in an Excel file. And then we tried to read it and we would eventually convert to CSV for the sessions we had. And then eventually it goes back into Excel or have these back and forths. And then we take a Hugo ID like SCP-12 and think, oh, that's September 12th.
00:16:17
Speaker
Oh, yeah, sure. So it converts that cell to a date. And then when you save it as a CSV file, it converts that as an integer from some reference date. Right. And so like, you know, you're looking at this spreadsheet that's got maybe like 20,000 things in it, you're like, why is this long number there? And then later, if you figuring this out, you're like, Holy shit, that was a date, right? You know, having don't do that. I mean, sounds silly. But like, that's what I mean is like, you end up fighting the process so much if you don't have good tooling.
00:16:46
Speaker
Yeah. And, and, and are marked down, shiny. And now Gordo is this, it's so fundamentally, like, I'm kind of jealous of people now is they don't have to, they don't have to live with the game that, you know, we had before. Yeah.
00:17:01
Speaker
you know, anything like GitHub, you know, I was using CVS in a really old version, you know, you know, version control and wow, you know? Yeah. Yeah. But I want to say like old man, but like, Oh, I know. But some of those, even some of those, like, some of those, like, those one that the version controls offer that you'd have to purchase were just like, they were just not very good. Yeah. And just hard to use. Yeah. And then the notion of needing it for data analysis. Yeah.
00:17:29
Speaker
A statistician and I were meeting with some other statisticians and we were talking to them about using version control, like, oh, we don't need that. And the person I was with, his name is Jim Rogers, said, and he said it in not a contest anyway, but he basically said, you do, you just don't know yet. You just haven't lost something that you wish you had. And then when you do, you know, remember, I'm like three doors down and I'll talk to you.
00:17:54
Speaker
evangelize the gospel of the version control. Okay, so let's pivot a bit. So you have a new book out with Julia Silge, who was on the show a few months ago. So I sort of noted it as like a review of the code syntax behind tidy models, but it's not really a review. It's more of like a step by step.
00:18:14
Speaker
Um, but it also is kind of a primer on regressions and modeling. And so I'm curious, like, let's start with it. So it seemed like there's these two pieces. And so did you set out to have these two pieces? Like not just going to give you a step by step in this particular coding language, um, which is kind of what like Hadley's are for data science book does. Um, but yours is more of like, I think you could sort of think about as like an intro stats textbook with doing it in R.

Purpose of the Book on Tidy Models

00:18:40
Speaker
Like, was that the, was that the idea?
00:18:42
Speaker
Uh, we thought it out like that. I think, I think that the problem we have, assuming it's a problem is that.
00:18:49
Speaker
When we want to teach any of this, whether it's in a workshop or in a book or whatever, it's very hard to be like, and here's how you do resampling. And people are like, well, what's resampling, right? You have to front load all this information about like, what's a training set? What's a test set? And we don't, we're usually writing for people who are not experts at this. Again, like, you know, we're not writing for ourselves. We're writing for people who work at a bank and their boss is like, hey, I read this thing. Go do a linear regression. We're like, what's linear?
00:19:19
Speaker
Right. And so, like, so, you know, people just don't know. And so a lot of times we want to talk about the syntax of what we're doing, even if it's something that's not particularly fancy, we do have to sort of talk about the nomenclature and the nomenclature leads a little bit into like, well, why would I do that? Like, why would I save some data as a test?
00:19:37
Speaker
And so it's really, I don't think we'd ever be successful in teaching anything if we were like, and here's how you do a random forest without, oh, you want a random forest? To some extent, tell me what a random forest is. So I don't think there's any other way to do it. I mean, unless you're looking at like a pure like statistics or machine learning type of book where there's no syntax or application, I think you kind of have to do them hand in hand or otherwise it doesn't really,
00:20:05
Speaker
And the problem with me sometimes is just stopping myself from going on. What's the minimum amount that makes sense? And there's plenty of things I would like for them to know, but hey, this is a book on modeling software, so let's not worry about other things. Well, also the thing that's interesting is I think about your book.
00:20:26
Speaker
compared with like Hadley's book. Hadley's book starts with data visualization, right? And I don't know why he did that. My guess is that because with data visualization, when you start with GG plot, you get something right, you can see your success right away, right? You don't have to, you know, cleaning is kind of boring, you know, it's not like, and you get this success right away. And so, I mean, how did you and Julia sort of think about
00:20:56
Speaker
Did you think about that? You're like, just someone who's going to read this book is someone who's going to be interested in modeling and learning this. And, you know, we have to sort of think about that much as that much of a setup.
00:21:08
Speaker
Yeah, I don't know, I feel like if they were reading that book, they have a reason, right? They're like, all right, I've heard of this, or maybe I haven't heard of it, or haven't used it, but I know of its existence, what does it do? We don't really, a lot of times in training materials and some books, we do this like what's called the whole game, where we give them a little introductory chapter that's not very in depth, but it gives them a roadmap of like, hey, here's an analysis and what's happened.
00:21:34
Speaker
And then we'll go through this in detail later. We didn't really do that with this book. We kind of approached it from the standpoint of the first parts of being like about our philosophy. Like, why would I care about this? Like what, you know, sell me on why this is important or I should spend time with this. And then we went a little bit into like, well, why aren't you using base R? And then from there it kind of proceeds in terms of like how you would do your analysis. So we talk about like, you know,
00:22:00
Speaker
a little bit about exploratory data analysis. We talked about data splitting because this is the first thing you do. And then talk about like your first model and then, you know, eventually, you know, worm your way through to measuring performance and tuning models. And then there's a bunch of just assorted interesting stuff at the end. So if anything, it emulates sort of the, the process that you would analyze your data, which you wouldn't measure data. So, um, yeah. And it's mostly.
00:22:27
Speaker
Like a book where we want to give them an early win or something like that is more a book where we want to sell people on using it, using that method in general. For Hadley's book, it might be somebody who's coming out of Excel and they're like,
00:22:41
Speaker
to give them something like that, like you said, is like you get them kind of hooked in. Shawnee does the same thing. And with us, we sort of had the premise that, all right, well, you're picking up this book for a reason. You're going to do some modeling. Maybe you've heard of the Teddyverse and Teddy models. So now you're in, what do you do? Absolutely.
00:23:00
Speaker
So the case you gave is, you know, the bank analysts has asked by his boss or boss to do something.

Skepticism Towards AutoML

00:23:06
Speaker
I'm curious what you think about, this is going totally different direction, but I'm curious what you think about making modeling too easy for people, right? That, you know, this has come up so much. Oh, I thought about this so much. Um, recently it's been a point of discussion.
00:23:27
Speaker
I mean, so, uh, this is like heresy to say, and people are going to be like, all right, delete now. But like, I am really, really not a believer in like any automated machine learning. I feel like that is like a recipe for disaster. It's not like Skynet like taking over, but you know, where things go wrong is, and that's been like my mantra, my entire career is like, what's the worst that can happen? And so, you know, most of the machine learning that I've done is like assistant.
00:23:52
Speaker
Like, you know, I don't need to tell a chemist, it's obvious this compound is the one you should synthesize to cure cancer, right? But giving them, I hate to use the word insights, but giving them prototypes like, hey, based on the data we have, here's like a de novo structure we think you might look at. And with the understanding that it might prompt them to say like, oh, I never thought about designing it that way or, you know, where they might take a structure that they have and say, well, how would this work if we made it? We haven't made it yet. But I'll give you the formula, you tell me how active it's going to be.
00:24:22
Speaker
And so that's sort of where I see the bulk of the utility and like machine learning and modeling and things like that. If we're talking about predictive models, as opposed to like making inferences and drawing conclusions. So, you know, that's sort of where I live. AutoML is something, depending on how you define that, that I think is really interesting and helpful.
00:24:43
Speaker
because I think as long as it instructs you where you learn something out of it. So to me, it's really scary to be like, oh, I just gave it my CSV file and I have this published model without knowing anything. So, you know, I feel like
00:25:02
Speaker
Yeah, I feel like we don't want to do that. In some sense, we have, I want to say put constraints on that and tidy models. We've enabled you to do a whole lot of stuff. In fact, a couple of the podcasts have talked about some of these tools that we've developed. They're like, oh no. But you know, just saying that like, you know, I had a lot of trepidation in making them. But on the other hand, like when I was doing modeling for a living, I would have wanted those tools.
00:25:25
Speaker
So we certainly don't have anything where you just blindly get results without any oversight. And also we put in a lot of guardrails to really prevent people from, there's a lot of pitfalls in machine learning. And I've done this, it's really easy to get a model you think has really high accuracy. And then six months later you get new samples and you're like, why am I missing them all?
00:25:48
Speaker
It's because you made a methodology error and didn't realize it. And so we know where most of those things are, and we've designed the software and the syntax to really, you'd have to go out of your way to
00:26:01
Speaker
do it poorly and not realize it. Yeah. So, you know, those are all things that, um, you know, that come to mind when people ask me that question. Um, you know, our friend David Robinson was talking about this in a blog post about Teddy monitors. And, you know, and his point was, yeah, don't let people fool you with that. You should be concentrating on your problem. You should not be concentrating on is my software doing the wrong thing in my
00:26:25
Speaker
unassumingly like making some big error. You should use tools that you feel safe with and give you results. And he definitely felt like, you know, that was something that Tony models does. And I, and I agree with him. Um, you know, you don't want to get bogged down in.
00:26:40
Speaker
you know, am I resampling the right way or am I doing it properly? Did I use the training set in the right way or the test site? And again, like when I was talking about like tooling earlier that it goes into that, that like if your job is modeling and you have good tools and I think we're making those, then you get to be so much, much more productive because you're not bogged down in the, you know, the errors or like, oh, but did I do that right? And so yeah, I feel like that's probably the best answer is
00:27:08
Speaker
You know, I'll admit like I was talking to somebody who I knew pretty well and I was at their institution and we were talking and he was using care at the time. And he said, you know, I try like a dozen models on carrot and I don't really know what a lot of them do or how they work, but I get one that seems like it's doing really well. And then I go read the paper.
00:27:28
Speaker
You know, be like, all right, so what exactly is it doing? How can I figure that out? And to me, that made me feel, the first part of the conversation was like, okay. And then the second part, I was like, yeah, okay, that's, I'm happy. So yeah, you know, we want to facilitate, we especially want to facilitate people doing things that they couldn't do before, whether there wasn't an R package do it or, and this happens to some extent, there's something that you want to use in R,
00:27:54
Speaker
but you end up slamming your keyboard on the table because it's so awful to use. And so we wanna like smooth all that out. We wanna just make those things either rewrite them or make them easier to use and more consistent and work well. And it allow you to do things that you really would have trouble doing before that it adds a lot of power to what you're doing. But again, you can do it in a way that you hopefully can feel safe in your application at least. Yeah, it's like building
00:28:24
Speaker
some roadblock, but not a wall, right? It's like a speed bump, right? You want people to slow down, but you don't want them to be able to get to the end of the street. Yeah. Maybe the analogy I'd be like is going off road.
00:28:37
Speaker
Okay. Honestly, I can go straight to my grocery store from my house, go through a lot of people's lawns and get there much faster. But the idea is like, but I have to do a lot of work to get there, right? Yeah. Here's another example from Carrot. Like early on somebody, and God bless them for doing this, but they emailed me and said, oh, I wrote this blog post by Carrot and how it helped me win this cowboy competition. I just want to say thanks. I was like, wow, that's really nice.
00:29:03
Speaker
Yeah, I got around to reading it. It was like, well, I mean, they weren't wrong, but they basically use carrots to try to map their training set to the test set and make them similar as possible to maximize their accuracy. I'm like, okay. Yeah. Yeah. I mean, we did it. Yeah. We just want to make sure we don't put roadblocks up anywhere, but like to do that, you would have to really like.
00:29:26
Speaker
You'd have to do some interesting things with tiny models to be able to like finagle that. Yeah. So it would be so like unnatural for you to do is like, why am I doing this? Yeah. I mean, I mean, it's interesting because there, there are lots of like design tools now, like Canva and Figma and these other tools that are sort of democratizing design. And I'm sure there's lots of graphic designers out there who hate those tools because you're putting these. Fairly powerful design tools in the hands of people who.
00:29:52
Speaker
don't know design. And I guess there's an equivalence with statistical packages where you're giving power to people who, like you said, they're running regressions, but they don't really know what they're doing, or they don't know how to interpret it, or they're, you know, mucking with the data in ways that maybe they shouldn't.
00:30:09
Speaker
And that is the modern history of statistics. So I'm serious. Like when I was, when I was in my twenties, I was reading papers about the democratization of statistics and people having mini tab or SPSS or whatever. And, you know, and I think as a profession, statisticians have had to come to terms with the idea that you can't be data cop. Right.
00:30:32
Speaker
You know, it's, it's not that it's in, you know, I think there's a history of in statistics of being sort of like, the analogy I use is like, like, you know, like a PhD in statistics is like some wizard that lives in a cave. And, you know, people are scared, they don't know how to kill the dragon. So they go and they like,
00:30:51
Speaker
you know, genuflect in front of the wizard, and the wizard's like, oh, I'll teach you the invitations to do this complicated thing and solve your problems. And, you know, the wizard's not out there helping them grow weak, right? And so, and I feel like statistics over the years has suffered because of this sort of inherent, I mean, it's a generalization, but I think a job that a lot of the statisticians I've worked with is like, yes, I'll come and bless what you're doing. And if you didn't do it right, I won't scold you, but I'll look disapprovingly.
00:31:18
Speaker
And so, you know, I feel like data science, as much as it was, you know, talked negatively about the whole Six Sigma thing, my experience with it and the corporation I worked with was actually very positive because there was a statistician almost every project.
00:31:35
Speaker
Yeah. And so, you know, we got in the mud a little bit and like, Oh, that's, these are the problems that they're facing, not like, should I use a T test or will Cox and test? So I feel like, you know, the, the environment we've had to accept the idea that we're not
00:31:51
Speaker
In fact, we've been marginalized to some degree because there's so many tools. And honestly, I think that on the computer science side, in some ways are a lot better at promoting and talking about things. And there's so many, many more of them that I feel like I've been in places where it's like, well, yeah, you're doing a bunch of statistics. Why aren't you in the statistics group?
00:32:12
Speaker
We're not because like, that's how we got 20 people instead of five people because we're not. So it's really been a detriment. The cycle repeats itself like every 10 years. It happened with, in the eighties with Taguchi methods and, and, uh, spectrometry, like partial least squares and happening in a machine learning with boosting where.
00:32:31
Speaker
You know, we're getting a little bit as a, I feel like as a population, we get a little bit complacent. We see things from outside our community being published that have really good ideas, but maybe aren't really statistically all that rigorous. And then, you know, it's more like once you have ideas, we can refine them. Right. And I feel like we're on the road to being more integrated.
00:32:52
Speaker
and more hands-on and more just generally proactive than we were before. Because we don't have much choice. In a way, that's a good thing. Yeah. The last thing I wanted to ask you, especially for folks who are listening to this who are maybe less interested in the modeling side and more interested in the data vis side, can you talk a little bit about the link between tidy models and I guess ggplot would be the way most people would go?

Importance of Data Visualization with ggplot

00:33:18
Speaker
So I say this quote all the time, if you've heard me speak before, you've most likely heard it, but a professor I had said, like, you know, the only way to be comfortable with your data is to never look at it. And so before doing any modeling, you know, we promote, you know, our workshops and things like that. Hey, let's take 10 minutes and whether it's based or just
00:33:39
Speaker
summary statistics or ggplot, look at your data. What do you notice about this? That really informs the models. So, you know, there's usually this big feedback loop of, like, you build a model, it works okay, you can figure out where it doesn't work, and then you have to do a bunch of, like, exploratory analysis if you're out. Why not those, right? Why not examples doing poorly?
00:34:02
Speaker
Also, recipes in particular have some nice tools where you can do, especially for high dimensional data, you can do very helpful and informative feature reductions. There's always principal component analysis, but there's a whole host of things like that, some of which are nonlinear, some of which are supervised, that will really help you understand your data a lot better. Back when I was doing computational biology, the minimum number of outcomes I had in my experiments were about 7,000.
00:34:30
Speaker
You know, ranging up to about, you know, like a million, right? Yeah. And so now you have all these dimensions and it's a very dense dataset. Yeah. And how do I know if, you know, but you've got like 30 samples, you know, how do I know if any one of those samples is problematic? And so. Yeah.
00:34:47
Speaker
using these particular techniques it gets you very far very quickly to figure out like oh um yeah this one this one's all goofed up because you know they they ran it on a week or two after the others or you know whatever the example might be.
00:35:03
Speaker
If we do anything analytical, like in terms of analysis, it's kind of a recipe for disaster if you're not looking at your data. ggplot facilitates that to look at your data, and then what you learn can then facilitate going backwards. You might, in some data analysis, figure out two of these model terms are really interacting with each other.
00:35:26
Speaker
And that's maybe the GG plot that you show to your boss's boss that says like, oh, look, we can exploit this interaction and make more money or do better widget or whatever it is that you're doing.
00:35:38
Speaker
You really can't divorce data analysis and visualization. They really are tied together. Yeah. Okay. So terrific.

Availability of the Book and Additional Resources

00:35:50
Speaker
Thanks so much. So tiny models, people can buy the actual physical book, but there's also an open source version that they can check out. And I'll put the link to that. And there's code snippets in there too, right? So they could, they shouldn't, but they could just copy and paste and get things to run.
00:36:03
Speaker
And the whole book is there, like all the source files that if you wanted to compile the book and print it on your printer, like, I mean, have fun with that.
00:36:12
Speaker
Yeah. It's all out there for you. And there's also tinymodels.org is a nice website we put together that has a lot of tutorials and a lot of really good resources. So if you like long form, check out the book. If you want more short form, almost like blog posts, length, information, that's a better way to go. All right. Well, I'll put links to that and everything we talked about. I've got a whole, I'll have like the history of statistical packages on this, on the show notes for today. So terrific. Max, thanks so much for coming on the show. This is great.
00:36:59
Speaker
Thank you. Thank you for running.
00:37:09
Speaker
The whole team helps bring you the Policy Vis podcast. Intro and outro music is provided by the NRIs, a band based here in Northern Virginia. Audio editing is provided by Ken Skaggs. Design and promotion is created with assistance from Sharon Satsuki-Ramirez, and each episode is transcribed by Jenny Transcription Services. If you'd like to help support the podcast, please share and review it on iTunes, Stitcher, Spotify, YouTube, or wherever you get your podcasts.
00:37:32
Speaker
The Policy Vis podcast is ad-free and supported by listeners, but if you would like to help support the show financially, please visit our Winnow app, PayPal page, or Patreon page, all linked and available at policyvis.com.