Introduction and Guest Introduction
00:00:11
Speaker
Hi, everyone. Welcome back to the Policy Viz Podcast. I'm your host, John Schwabisch. And on this week's episode, we are going to talk about R. And we're going to talk about how to learn R. And to help me do so, I sat down with Michael Freeman, who is a lecturer at the University of Washington Information School out in Washington state. He and Joel Ross have just published a new book called Programming Skills for Data Science.
00:00:35
Speaker
start writing code to wrangle, analyze, and visualize data with R. So Mike and I sit down and we talk about how he uses R. We also talk about other parts of the data science ecosystem like GitHub, which is a pretty sizable fraction of the book, which I think is a great addition to the general books that are out to date that talk about how to use R and how to code in R, which is obviously a subject in and of itself, but also how to share code, how to use some of these other
Podcast Support and Upcoming Episodes
00:01:02
Speaker
tools and platforms that are out there. We also spent a little bit of time talking about how Mike teaches our in his classes at the University of Washington. So it's really interesting discussion and of course before I get over to the interview just another reminder if you'd like to help support the show please consider becoming a patreon supporter or reviewing the show on iTunes or your favorite podcast provider. All those are really about
00:01:25
Speaker
Appreciated to help others learn about the show and listen to the show and of course to help the the financial needs I have for the show to help with the audio editing and the transcription services and all the other Good things that I need to help bring the show to you every other week
00:01:40
Speaker
So there's a few new episodes I've got coming up in the next few weeks. I'm really excited about some of the guests I have. I'm branching out a little bit into areas that are associated with data visualization and communicating. So I'm really excited to bring you some really fun guests that I've lined up. So here
Michael Freeman's Background and Transition to Data Science
00:01:57
Speaker
we go. This is my interview with Michael Freeman, senior lecturer at the University of Washington Information School.
00:02:06
Speaker
All right, so I'm here with Mike Freeman, who is a senior lecturer at the Information School at the University of Washington, who has a new book out, Programming Skills for Data Science, Start Writing Code to Wrangle Analyze and Visualize Data with R. Mike, how are you? Welcome to the show. I'm doing great. Thanks so much for having me, John. It's been a while since we chatted last, I think. Was it at OpenViz? It was at OpenViz. Yeah. Many months have passed. And since that time, you have a whole new book out.
00:02:34
Speaker
I do. I do. I thought I was nearly done with it last time we spoke and now I'm surely truly done with it. I can hold it in my hand. Before we get into the book and talk about the book, maybe you can talk a little bit about yourself and your background and maybe also your beginnings with R to sort of set the stage for folks so we can talk about the book. That sounds great.
00:02:58
Speaker
I am a senior lecturer, as you said, at the information school. If you're not familiar with one, it is where one would learn the skills for working with information. Historically, that was a librarian degree, a master's of library and information science, and the modern version of that or the
00:03:15
Speaker
incarnation of that is an informatics degree targeted at undergraduates where they learn the skills for working with information that incorporates data science user experience design and a variety of other courses that intersect with how people information and technology intersect.
00:03:32
Speaker
There I teach courses in programming, data science, web development, and data visualization, and have a lot of fun with all of those. And the way that I've come to this is my background is actually in public health. I have a master's degree in public health.
00:03:48
Speaker
And just about everything I learned, I learned when I was working at a global health research center called the Institute for Health Metrics and Evaluation, which is part of the University of Washington.
Challenges and Approaches in Learning R
00:04:00
Speaker
I hadn't written a single line of code before I got there. It was, you know, I studied sociology as an undergraduate and was shocked to learn that studying global health quantitatively wasn't just sitting around and thinking about it, but involved writing a lot of code.
00:04:14
Speaker
And most of that while I was there, I worked a little bit with Stata or Stata. I'm going to have that separate debate later. And some of that was in Python and some of that was in R. And I had a really challenging time figuring out the different environments and how different things fit together. And just getting started writing code to work with data was something that took me a long time. I didn't come by it so naturally. I didn't start writing code when I was 13.
00:04:43
Speaker
And a lot of what is in this book is the set of things I wish I knew when I started working in this research and analyst or quantitative analyst role. So before we go on, I want to make sure that that folks know that you co author this book with Joel Ross. Yes. So let's make sure we give him props. So he he's also at the information school. Yes. So Joel Ross is also a senior lecturer at the information school couldn't be with us on the show today. But he is a fabulous instructor and writer.
00:05:12
Speaker
a really great educator and it was truly a delight to work with him. Coincidentally went to the same undergraduate college, Colorado College, though we weren't there at the same time. We really enjoyed working together on this book and I usually point to some of the better sections as the ones that he put together. So it sounds like from when we've talked in the past that this book started as a collection of online tutorials that you two had written, I would presume for your students, is that
Role of Version Control in Data Science
00:05:42
Speaker
Yeah, so the origin of this was actually for a course that I put together called the Technical Foundations of Informatics, which I often described more publicly as an introduction to an introduction to data science, which was not a book title that anybody liked, but I thought it was funny.
00:05:59
Speaker
But we wanted to ensure that students entering our program had a shared foundation so that they understood the foundations of data literacy and visualization, that they knew how to use different tools like Git and GitHub in their terminal, and also that they knew sort of these overlooked skills like how to use Markdown or how to collaborate on GitHub. So we wanted to have that foundation. Now we put about a thousand students a year through that course.
00:06:23
Speaker
But we also wanted to bring this to people that weren't going to study informatics independently, but might do something like minor in informatics and study political science or study sociology. And that has been some of the most interesting cases. You know, now I have students that come to me and say, hey, I'm an international affairs student and I just made a FOIA request for all the deportations by ICE. And how do I work with this data? I think that is an excellent candidate for these skills to programmatically interact with it and generate
00:06:51
Speaker
some really compelling resources using that information. Yeah, yeah, absolutely. So let's talk about the book. So where does this book, in your view, fit in with the existing library of our books that are out there? I mean, there's, there's a ton of them, we're not going to list them. But where does this fit? What sort of gaps in the literature did you guys identify saying, you know, there's certain folds this book can fill?
00:07:17
Speaker
That's a great question because there are a lot of great R resources and we sure wouldn't have gone through all of the painstaking process of writing a book if we didn't think it was worthwhile. So I think there are a few things that make this distinct. It's an introductory book.
00:07:32
Speaker
So it isn't anything that people need any background. It also has a combination of skills. Like we really believe that if you're going to start writing code to work with data, you have to know how to keep track
Teaching Strategies and Book's Approach to R
00:07:42
Speaker
of that code. And you have to be familiar with the tools for navigating your computer. And you need to know how to collaborate writing code. And I don't just mean emailing a file back and forth with different, you know, underscore final, underscore final, final, underscore new final at the end of the tag. Right. For final v2. Right. Right. Or new final. And is that the newer final or whatever?
00:08:02
Speaker
Yeah, and I think we also we dug into a lot of the pain points right and I'm teaching 150 students right now using this book and The biggest pain point installing stuff on your computer right handling different types of errors that arise So we wanted to cover those skills
00:08:17
Speaker
We also wanted to have a little bit of the conceptual work in there. So we have a chapter on understanding data, you know, different types of data structures, different types of ways that data gets generated through surveys and through sensors. We have a similar chapter on designing data and visualizations, which isn't about writing the code, though that's the focus of most of the other chapters.
00:08:39
Speaker
but brings in some of those best practices and principles so that when you are writing the code, you're generating something that is influential and clear and effective. It also doesn't go into a lot of the statistical side. So we didn't want to focus this, say, on statisticians or machine learning. We think of these as the skills that underlie those different fields because, as we know, 90% of doing data science is data wrangling and the other 10% is complaining about it.
00:09:07
Speaker
Right. I do want to get into a couple of things about the book that I find that are particularly interesting, the GitHub section, the R Markdown section that I think aren't really in a lot of the other books. But before we do that, you were talking about teaching your 150 or so odd students. And I've been talking with people a lot lately about learning code. I basically learned R like two months ago after putting it off for a long time. There's a great deal for you.
00:09:34
Speaker
There's a guy here, Aaron Williams, who's Urban's R leader, basically. And I said, Aaron, these MOOCs aren't working for me. I don't have the, you know, the six weeks to sit down and do them. So, so he sat down with me for two full days and we just did this R learning sprint. And that really worked for me. So I'm building up to this question of when you are teaching your students R, what are some of the strategies you use so that they can overcome some of these challenges? Like you said, installing is a big hurdle for people.
00:10:04
Speaker
I guess you can write here the steps, but does that work for everybody? I found it really helpful to sit down with him and say, okay, I don't know how to do this thing. Can you help me figure it out? Having someone next to you is really helpful. Absolutely.
00:10:18
Speaker
The book doesn't do everything. We have a bunch of exercises and their solutions online, and those are publicly available for anyone. So those are up on GitHub. Anyone can look through those. Some students really like to read the solution to something. Others like to work through it.
00:10:34
Speaker
But it comes down to having a variety of different techniques. In a two-hour class session, maybe half of that students are actively writing code and myself and a group of TAs are walking around and sitting next to students and troubleshooting and also encouraging them to work with one another. If you have a question, particularly if you're packed into the middle of a row, ask the person next to you. We also do a lot of active learning assessment. So I might explain something like how to access a particular row of a data frame.
00:11:04
Speaker
And then I'll use a poll, like an in-person poll, and put up five lines of code and say which one of these lines of code will get the information that you're looking for. So I actually get an assessment during the class time about whether or not students have absorbed the material yet. Right. Right. So what's the ratio of TAs to students? It's about 1 to 25. OK. OK.
00:11:26
Speaker
Um, but yeah, that, that collaborative learning where you have someone sitting next to you who might be able to solve your problem, I think is a good point. Um, but okay, back to, back to the book specifically. So you start the book, setting up the computer, like you said, and installing. And then the second part of the book, like.
00:11:42
Speaker
I'm looking at it now, like 20 pages in, you get to managing projects. And I think that's a really important part of the data flow, the data workflow that is missing from a lot of books. So can you talk maybe a little bit about that section of the book and also what your experience is teaching that content? Because probably for a lot of people, the lingo, the jargon for GitHub is really
00:12:04
Speaker
difficult to get. And so what has your experience been teaching those couple of chapters?
Integrating GitHub and R for Collaboration
00:12:09
Speaker
Yeah, so we put the managing projects and version control right at the front of the book, largely because years ago, I did a survey of my students and I said, you know, could you describe your confidence with programming, you know, no experience, moderate, lots, whatever.
00:12:25
Speaker
And can you also explain your experience with keeping track of your code and version control? And I think one year, 82% of the students in an introductory course said, I'm a moderate or confident programmer. And of those 82%, I think it was maybe 12% had used GitHub or any other version control system.
00:12:44
Speaker
So that's akin to saying that you're a good driver, but you don't know how your seatbelt works. You don't know how these safety mechanisms are in place. So it is the first thing that we talk about. We actually talk about how to keep track of projects before we talk about the R programming language. And it's a little tedious, and it's a little abstract.
00:13:01
Speaker
But GitHub as a tool is something that makes it a little bit more satisfying. You can actually write markdown code and then push it up to GitHub and have it be hosted as a website if you configure your branches properly. So that's something that I think students get excited about. It's easy to see what that progress is.
00:13:19
Speaker
Certainly the more complicated aspect of that is working in teams. So before this class existed in upper level, say web development courses, we would just say, you know, we're going to group for this project without spending the one or two class sessions saying, you know, what is a merge conflict? Or how do you have multiple people working on the same project?
00:13:40
Speaker
So it's something that is one of the more important things about actually working together on a data science project, but is overlooked. And for that reason, we teach it early. We have them adding, committing, and pushing every day as part of demonstrating that they've worked on some exercises. And then later on, we have specific exercises and assignments where they work in groups together and work. We force them to create merge conflicts so that they aren't scary when they happen.
00:14:07
Speaker
Right. Right. Really interesting. So the other question I had for you was, I'm always curious how, obviously it's a tool book for data and database, but I'm always curious about how people pick the functions that they're going to highlight. So the book has the tidy verse, you know, framework, but I'm curious about when you're going through and you're writing this book, like how do you decide which commands and functions you're going to include and which ones you're going to exclude?
00:14:32
Speaker
That's a great question. How do we decide what makes it in? Because even though it is nearly 400 pages, there's a ton that isn't covered. I think it came down to a couple things. One was just, what do we use a lot? If you sit down and you grab a dataset and you want to do some analysis, make a report about it, make a variety of charts,
00:14:54
Speaker
What are the skills that you need to create that product at the end? And that often involves reshaping your data, which is why there's a tidier chapter. It isn't just because I think that gathering and spreading are like cool functions. They were ones that I used and they're ones that are kind of conceptually difficult, right?
00:15:10
Speaker
I think the idea of the book is it sets people up to then go further. Not everything is in it. And if software evolves the way that it should, there will be another set of functions and packages that are important to be using in a few years that make things easier.
00:15:25
Speaker
But the ones that we selected are really commonly used. We would look around on blogs and Twitter and whatever else to see what people were using to solve problems. But the purpose of doing data science isn't to write code. It's to figure out answers to your questions. And these are the tools that we found most often helped us answer questions about really important things.
Advanced R Topics and Tools for Data Science
00:15:44
Speaker
And we have these in action sections, maybe at the back of six of the chapters that take a real data set and walk through the
00:15:52
Speaker
whatever the cleaning challenge is and do it to try and surface really important patterns about anything from police violence to evictions in San Francisco to things of course that are more light-hearted like finding good restaurants in Seattle. So there's a balance there but we were selecting these things based on what we thought people needed to know to get the information that they were hoping to produce in the first place.
00:16:15
Speaker
Yeah. I mean, what I find really fascinating about the book is that unlike a lot of the other R books, and like you said, you're never going to be able to capture everything. So this isn't a knock on any of those other books either. But what you and Joel have seemed to have done is take a lot of the core things of R and branched out just slightly to do things about sharing. And also, I'm looking at the chapter on accessing web APIs, which doesn't seem to be a big part of a lot of other R books.
00:16:44
Speaker
but seems to be one of the big advantages of tools like R and Python that seems to be part of the new data science, new programming language toolkit.
00:16:54
Speaker
Yeah, I think so often in, say, university environments, people are just handed a data set. And we found that a lot of times people were trying to use more robust or complex data sources. So we do have a chapter on interacting with databases. We do have a chapter that introduces web APIs and how to use them, and covers, of course, both the conceptual of what is this thing and how does it work, and also how do you then interact with it from the R programming language.
00:17:21
Speaker
Right. So towards the end, you have the last section, the sixth section is on building and sharing applications where you talk about R Markdown, Shiny, and some other collaborative tools. But what is, now you can see I haven't really read the book cover to cover yet. I've just been tagging like the things that I need. But like the difference between an R Markdown and then Markdown.
00:17:41
Speaker
That's great. Great question. So you're now quizzing me to see if I know the material in the book, or if I can explain that well. I could just ramble here for a couple seconds. Come on, John, you should have gone to page 400. So our markdown is awesome because it combines the easy to use syntax of markdown with the ability to write our code in the middle of the document. So if you wanted to write a report on
00:18:10
Speaker
urbanization in Washington DC or what's going to happen when Amazon joins your city, separate conversation. You could both have the writing that you would do and the headers and the images, but as your calculations change, you could actually inject those figures or numbers or charts based on the data. So you could start with basically a dot RMD script,
00:18:34
Speaker
where you could write everything in markdown. But anytime you want to reference a value, you could actually reference an R variable that you'd created. And you could have these little chunks of R code that produce your plots or maps. And then you can compile that into something like a website, like a dot HTML file.
00:18:49
Speaker
Do you think that that is the future of publishing, especially technical books? I mean, we're seeing more and more of books that are already written in Markdown or are Markdown. Is that do you think the future of, I guess, things coming together in the publishing world and the programming world and the open source world is having everything in a Markdown document before it actually gets into the hardcover paper version?
00:19:13
Speaker
That's a great question. So, as you had referenced, we made a set of tutorials. Those are still online and publicly available and free at info201.github.io so people can go read something that this book is based off of. Believe it or not, we wrote this entire book in our markdown and then instead of compiling it to a web document, compiled it to PDF so that every time we had some sentence that said, you know, the number of times that the word data appears in this book is 482.
00:19:41
Speaker
That was actually a variable and it was counting the number of times in the text. So it was something great for our interactions. We also did this all using Git and GitHub and version control so we could both work on the book at the same time. And it's a great way to share information. I think a lot of what publishing a print book meant for us was it forced us to really polish it.
00:20:03
Speaker
and really come up with excellent examples. No one wants to read a chapter based on an example of car engine sizes in 1976. That's not interesting. So we would go back and we would revise these and we would look at each state in the United States. What was the proportion that they went for the Democrat or the Republican for each national election?
00:20:24
Speaker
And correspondingly, we would, instead of having like weird diagrams that we made using Google Slides, we would have these professionally cleaned up and think through them more carefully. And that was, you know, that's what makes this the next level up from this set of tutorials that we would have online. Right, right, right, right. So let me switch. I have two questions for you. We're going to do like, love and hate. So let's start with the hate. So what do you hate most about R?
00:20:52
Speaker
Uh, well, if I hated it, I'm curious, you know, uh, cause I've just, you know, I'm like, like two months into my learning curve on R. So I'm curious, you know, for someone who's been doing this for a while, like what's the thing that bugs you the most? That's a great question. Uh, I mean, I obviously like it enough to write a book about it, but you can't, you can't write a book about something and have universally positive emotions about it.
00:21:13
Speaker
As someone that has been using R for a number of years, honestly, the thing that I dislike most about it has nothing to do with the programming language and has everything to do with the question, why are you using the R language? It hasn't yet reached that level, particularly in the software engineering community of being recognized like, oh, you wrote a book about data science. Why didn't you write it in Python? It's like, I could have. We could rewrite this book entirely in Python.
00:21:42
Speaker
work with the same data and achieve the same things. But I think R is easier to get started with because of RStudio and some of the tools within that IDE. And it's a lot more popular in the social sciences, which is my background. So I think that is perhaps one of my biggest grip about using R, though come to me next time I'm in the middle of a project, and I'll complain about whatever error I'm getting in donuts.
00:22:09
Speaker
Well, it's interesting. There was a little Twitter thread a couple of days ago, and the person will have to forgive me because I don't remember who it was, but it was an undergraduate student applying for PhD programs in economics and wanted to know they had learned R and what else should they learn before
Evolution of Data Science Tools and R's Role
00:22:25
Speaker
applying. And someone said, you should learn Stata because this is what all the
00:22:27
Speaker
you know, researchers who are, you know, current senior researchers, that's what they use. And so if you're going to work for a senior researcher as a research assistant, you should know Stata. And, you know, that's interesting because where I am, it's the same sort of thing, the senior researchers, I mean, my background is in SAS and Stata, but there's obviously an evolution towards R. And so it'll be interesting, especially over the next, you know, five to 10 years to see how things change and the programming tools that people use evolves.
00:22:55
Speaker
Yeah, and I think that focusing on free and open source tools is great. And also having that shared language maybe isn't necessary as you go into research or professional environments. I have a friend who's a professor who said, you know, before in his undergrad, he got great at R and then his PhD program, he got great at Python. And I was a professor and he's great at PowerPoint. Like we don't necessarily need to have those same rules because you should be able to translate the types of things that you're trying to accomplish across different languages.
00:23:24
Speaker
Yeah, and I kind of feel like, you know, Stata, Assassin R, and SPSS, for example, like those seem all within, you know, all clustered relatively close together, you know, like JavaScript and Python sort of seem, you know, more object oriented further away from that from that core thing. And for me,
00:23:41
Speaker
being able to code in Stata, going to R was not a huge lift because they are similar, it's just a different syntax, but the philosophy is somewhat similar. And so I agree, just knowing how to code in general enables you to maybe flip back and forth between different languages, just learning different syntax. Totally. Okay, so you told me what you hate most about it, and you've already mentioned a few things that you like most about it, but is there a single thing that really is the thing you love most about R?
00:24:11
Speaker
What I'm using it, it is awesome to like, I really like the tidy versus great, like to use deep wire and GG plot to, to really expressively talk about what I'm doing. Right. I'm like, Oh, I'm actually selecting and filtering and arranging. Like these things are really, yeah, they make sense. They're intuitive. And that makes for a really great workflow, really easy to read code.
00:24:37
Speaker
thing that I like about it the most in terms of teaching is the amount of progress that you can make in a quarter. It is amazing to teach in this environment where you can start by talking about the basics of storing values in a variable, right? This is something that students have perhaps never encountered before and end with this interactive web application where students have built a live website that is connected to an R session in which they can have
00:25:01
Speaker
interactive maps and multi-page websites and all of that is generated using the R programming language. So that breadth of different types of work you can do has made it something that is both really great as a researcher or as a professional and someone working with data.
00:25:19
Speaker
but also as an instructor to say, hey, look, we're going to start in the first two weeks. You're going to be putting the things on the right side of this weird assignment operator into the thing on the left side. And we're going to be kind of particular about it. And in the end, you're going to take on data sets about where there are mass shootings in the United States. And you're going to create something that exposes these patterns that you find pertinent and important. You're going to talk about it.
Conclusion and Encouragement to Explore the Book
00:25:42
Speaker
And we're going to move away from the specifics of the language to the importance of the thing you're working with. Yeah.
00:25:47
Speaker
So last question for you is how do you recommend people work through the book? Is it a
00:25:55
Speaker
Do they go cover to cover? Should they flip back and forth between the book and the website for the book? You do this obviously in your class. So what's your recommendation as people pick this up? What's the best way for them to work through the lessons in the book? That's an awesome question. So we have exercises associated with each chapter that are up on GitHub. I think it is github.com slash programming dash four dash data dash science.
00:26:21
Speaker
And I'm someone who learns a lot by doing. So what I would do is I would read a chapter and then I would work through whatever set of exercises I found appropriate. But what's great is those exercises also have their answers. So you can click an easy drop down menu and switch to what's called the different branch, the solution branch that has those same exercises with all of the answers to them. So if you're someone who learns more by seeing an example or by taking something that works and augmenting each little piece of it,
00:26:48
Speaker
that's what I would do there. Great. Well, congrats to you and to Joel. The new book just came out about what a little over a month ago at the end of December. Yeah, it's packed full. I recommend everybody check it out. Mike, thanks for coming on the show. This has been really interesting. Thank you so much for having me always a joy to chat with you. And I appreciate the questions and promotion of the book. Yeah, yeah, my pleasure. Thanks a lot. All right. I hope it helps you learn our in the next couple weeks.
00:27:15
Speaker
I'm working on it. I'm working on it. Now I've got, now I've got more to do here, but yeah, I've got more to do. All right. Thanks man.
00:27:26
Speaker
Thanks everyone for tuning in to this week's episode. I hope you enjoyed that. I hope you learned a lot. And I hope you'll check out Mike and Joel's new book. The link is on the episode show notes. You should really check it out. I've been making my way through it as I'm making my way through a few other R books as I'm trying to become better at R myself. So I hope you enjoyed this week's episode. So until next time, this has been the policy of his podcast. Thanks so much for listening.