Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Episode #69: Hadley Wickham image

Episode #69: Hadley Wickham

The PolicyViz Podcast
Avatar
186 Plays8 years ago

Hi everyone! Happy New Year and welcome back to the PolicyViz Podcast! I hope you had a relaxing holiday season. To kick off 2017, I’m excited to welcome Hadley Wickham to the show to talk about his work with the...

The post Episode #69: Hadley Wickham appeared first on PolicyViz.

Recommended
Transcript

Introduction to JMP Software

00:00:00
Speaker
This episode of the PolicyViz podcast is brought to you by JMP, Statistical Discovery Software from SAS. JMP, spelled J-M-P, is an easy to use tool that connects powerful analytics with interactive graphics. The drag and drop interface of JMP enables quick exploration of data to identify patterns, interactions, and outliers.
00:00:19
Speaker
JUMP has a scripting language for reproducibility and interfacing with R. Click on this episode's sponsored link to receive a free info kit that includes an interview with DataVis experts Kaiser Fung and Alberto Cairo. In the interview, they discuss information gathering, analysis, and communicating results.

Interview with Hadley Wickham

00:00:49
Speaker
Welcome back to the Policy Vis podcast. I'm your host, John Schwabisch. On this week's episode, I'm very excited to welcome Hadley Wickham, creator of many fantastic R packages, currently chief data scientists at RStudio, or just chief scientists? Not even specific to data, just science. Great. Hadley, thanks for coming on the show. I appreciate it.
00:01:12
Speaker
Thanks for having me. I'm very excited now. I think anyone who's worked in data visualization, data analytics, or just worked with data knows about you and the work you've done with R, but you eminently have a book coming out from O'Reilly Press, R for Data Science, that you were co-authoring with Garrett Grolmond, who unfortunately couldn't make it today.
00:01:33
Speaker
I want to dive in and talk about the book and get your thoughts on it and what the process was like of writing a book about coding. But maybe first you can tell people a little about yourself, your background, and maybe some of the art development you're currently working on.

Hadley's Journey to RStudio

00:01:48
Speaker
Sure, so I guess I'm like a self-identifier still as a statistician. I have a PhD in statistics from Iowa State. After I got that, I was assistant professor of statistics at Rice University for a few years, and then recently I've been full-time at RStudio as a chief scientist.
00:02:06
Speaker
So basically what I'm interested in is all the stuff that you need to do in order to do statistics that's traditionally been outside the realm of what statisticians study. So getting your data in R or whatever, getting into a useful form, transforming it, munging it, wrangling it, visualizing it, all that kind of stuff. So I'm interested in the whole process of data analysis and data science from the very beginning to the very end.
00:02:36
Speaker
I think I'm most famous for GG part two, but a lot of my more recent work's been on that kind of just making it easier to work with data, so you can get it to the point where you can actually visualize it.
00:02:45
Speaker
Great. So what packages were you using prior to sort of getting really involved in R? What were the statistics packages and languages that you were primarily using? And where did you see that they didn't do all the things that you needed them to do? To be honest, like R really was my first statistical, basically, because I did my undergraduate at the University of Auckland. That's the home of R. So even like,
00:03:09
Speaker
Back when I was an undergrad, even though kind of the fairly low level courses at Auckland used R. So I also, I learned a little bit of SAS then.
00:03:18
Speaker
as well, but it was kind of clear to me even then that R was the way to go. So it's really R is in your DNA from a young lad. So when you think about or when you talk to people who are using some of the other packages like SAS and Stata and SPSS, where do you find that they're talking about the pressure points or the points where they can't do things that you think R helps them do better?

Why Choose R?

00:03:43
Speaker
Well, I mean, I think one thing that's not exactly, well, it's definitely a pressure point, and that's the cost of these things. You know, R is free and open source, and these other tools are getting progressively more and more expensive. To me, like the power of R is kind of related to that open source-ness.
00:04:02
Speaker
I think when you're picking a programming environment or environment that's doing statistics, what you really want to look for as much as anything is like the community. Is there a community of people like you working in that environment and sharing their code? Because Ira is open source and it has this great package ecosystem, it makes it so easy for you to share your tools with other people and for you to use other people's tools.
00:04:27
Speaker
And so I think that's what, you know, as well as obviously like a really powerful statistical language, it's got great visualization, great modeling, but just the fact that you can also share all that kind of special purpose stuff with other people in your community is really, really powerful.
00:04:42
Speaker
Do you see a flip side to that? Are there concerns about the quality of the packages? I know a lot of federal government agencies here in DC are concerned about security behind it. What's the flip side for people? If you and you are an R evangelist, how do you address those concerns and people who have that frustration or those thoughts?
00:05:04
Speaker
Yeah, I think that's definitely a legitimate concern. But but to me, it kind of like the statistics in particular, like, there's so many ways to make mistakes unrelated to like implementation of the unlike if you if you if you have no idea what you're doing.
00:05:17
Speaker
And you have no way to interpret the results and kind of say, hey, do these make sense or not? Then you're in like a very, very dangerous position anyway. Like you can take something that has absolutely no bugs in it and still apply it inappropriately. So to me, like I think you have to kind of think a little more broadly rather than sort of trusting. Well, just because this
00:05:36
Speaker
Tool is produced by a commercial company who says it's you know, wonderful and great I think you really have to kind of think a little bit more broadly and so you say hey You know, how can I make sure that when I'm doing a statistical analysis? I am you know doing the right thing and it's doing what I expect it to do Yeah, I think a lot of that's tied up with reproducible research tools and and stuff like that so you can kind of audit exactly what you've done and then talk it over

Accessibility vs. Misuse in Statistics

00:06:01
Speaker
and
00:06:01
Speaker
Do you think it's dangerous for any statistics package to make it so easy to do statistical analysis where, you know, in Stata, for example, you just type progress y on x and, you know, you're on your way? Is that dangerous that it lets anybody run regression that they don't really know what they're doing? Or is that a good thing to let people do it so easily?
00:06:21
Speaker
So basically, I think it's a good thing. I actually have a sort of written a little paper on this called practicing safe stats with the kind of idea that, you know, traditionally like statisticians have sort of preached the abstinence based approach. You know, you should never do statistics unless you're in a committed long term relationship with a statistician.
00:06:42
Speaker
But you know the problem is you kind of see other people, you see your friends doing statistics and they're having a great time and so you think oh well I'll try it out and you have a great time and nothing seems to go wrong. So I think rather than trying to preach abstinence and trying to stop people doing it unless they're in the
00:06:57
Speaker
company of an expert. I generally think it's much better to think about how can we enable people not only to do this stuff but then to critique what they come up with. It's not just about really complicated statistical algorithms. It's also about giving people simple tools to help them activate their innate skepticism and their innate sense and really critique what they get out of the statistical toolkit.
00:07:25
Speaker
And do you think that extends to some of these other portions of the pipeline of working with data when it comes to design for visualizations, when it comes to creating drop and drag interfaces for interactivity? I mean, do you see a place where things are maybe too easy to do and it lets people who may not have the skill sets get in? Or is it all just good if we're making tools that are just as easy for people to use as possible? That's all good.
00:07:52
Speaker
I think it's all good, basically. I think you just have to accept that people are going to like screw up and make mistakes, but that's fine. Sometimes the only way to learn is to make a mistake and there's just too much data and too much need to analyze it to kind of say, well, you know, if you want to be able to do statistics, you better go away and do, you know, a two years master's program that just cannot work. Like we just have to enable people and give them the tools to be successful.
00:08:19
Speaker
And when you're building your packages, when you built ggplot and ggplot2, there are, I assume, and I've seen some of this, that you've built some things that you sort of let people do and don't let people do. I think dual vertical axes is one that I was reading about the other day. So how do you balance those decisions when you're developing?
00:08:39
Speaker
You know, there are sort of some things that I think are just a terrifically bad idea, and like dual y-axis are one of those. Now, interestingly, the next version of GG port 2 will have a specific type of double y-axis, and that's where one axis is a one-to-one transformation of the other. So you can still have like one axis that's Celsius and the other that's Fahrenheit if you want. But, so there's things that I absolutely think are a bad idea and I'll never support.
00:09:07
Speaker
But there's also just like a whole lot of stuff that I'm kind of like, well, you know, that's fine. I just don't care enough about it to do it myself. So if you want to do it, go ahead. But I think generally like that, that one of the things I do spend a fair amount of time thinking about is how can I kind of steer people down the right path without like locking them into that.
00:09:31
Speaker
To the destination so that that that that that that to me like want the correct thing to be easy to do but you're still free to do other things that you know, maybe I don't think are correct or maybe not correct all the time but might be the right thing to do for your case. I mean there's a lot of gray in this in this field right. Exactly and and again coming back to like SPSS and state I think one of the great powers of ours that because of the programming language.
00:09:58
Speaker
No, but by the strength and the weaknesses, it doesn't lock you into any preconceived analysis. You can do anything you can imagine. And the downside of that is, you know, when you open up, there's just this blinking cursor staring at you. There's no hint that's what you should do, but you can do whatever you can imagine, for better or for worse.
00:10:17
Speaker
And do you think that the ability of people who are using R to use this entire workflow from pulling in the data and cleaning it and processing it all the way to, you know, through some of these other packages like Spotify, and creating presentations or creating interactive visualizations? Is that another one of the pieces of R that sort of sets it apart from from the sort of the standard or the classic statistical packages?
00:10:40
Speaker
Yeah, absolutely. I have a great story from my friend Hilary Parker when she worked at Etsy that she was giving a presentation and sort of like five minutes into the presentation someone said, oh hey, we updated that database like last night. So you were giving us like a presentation about data that is no longer accurate or correct.
00:11:01
Speaker
She just quit out of it, press Command-Shift-K to re-knit the whole document, and then went back to presenting. That just blew the mind of the people in the audience. These are software engineers who are pretty seriously impressed by that whole reproducible workflow. I think that's just such a powerful thing because the data is always changing. Even if you think the data is not changing, it always turns out that it is. There's always something new that's going on.
00:11:31
Speaker
Great, so I want to talk a little bit about the book. Can you maybe give us an overview of what you hope to do in the book and what you hope people learn from it?

For Beginners: R and Data Science

00:11:42
Speaker
So basically the goal of the book is if you have never used R before or if you've never done data science before, use here to read this book and get basically
00:11:50
Speaker
all of the tools you need to do to do good data science. Now that aim might be, I'm not sure that the aim has completely succeeded if you're like absolutely new to R, absolutely new to data science. It's probably going to be tough just to learn that from a book, but that's very much the goal. Certainly, if you know a little bit about R, if you know a little bit about data science, this should get you up and running as quickly as possible.
00:12:13
Speaker
And is this the structure of the book, does it follow a syllabus that you've used in classes or in workshops? I mean, how is that development process and what sort of the philosophy that you and Garrett have behind learning to code? Yeah, so I think it's really steered by our experiences teaching this very similar material at Rice University several years ago. And to me, like the most important thing
00:12:38
Speaker
With teaching anything really is to make sure like the motivation is kind of up front and central to at least the early parts of the course because like I'm not gonna like I'm not gonna lie and tell you that like learning how it's gonna be this beautiful pain free experience for you just like smiling and laughing the whole time it's like absolutely gonna be period for you like so frustrated and you're banging your head against the desk.
00:13:02
Speaker
The goal of the book is to make sure we get to some really cool stuff as quickly as possible. That's why the book, anytime I teach data science, I always start with visualization. Just because it's like this neat immediate payoff, it lets you do something really easily that was really hard or impossible before. That gives you that motivation. You immediately see the light at the end of the tunnel and later on when you go through those dark patches.
00:13:29
Speaker
You've got that motivation to keep pushing through. Right, right. What was the process like of writing a book like this? I mean, it's a tome. What I've seen, it's a tome, right? So what was the process of sitting down with Garrett and sort of putting all this out that you've taught and is in your head, presumably, and just getting it on paper? What was that process like?
00:13:49
Speaker
Yeah, so I mean, one of the things that I am absolutely sold on about for writing books now is writing them in the open. So basically, at every point, you know, the book is all R Markdown files on GitHub. And at every point, we're like, we have a script that basically every time we change one of those files, it automatically rebuilds the website. So everything was in the open. And to me, like, that is just like so important. Because if you don't do that, if you're just like sitting
00:14:16
Speaker
you know, in your little room like beavering away on the book for months, it's just so hard to keep up motivation. And so I love kind of like working in the public because people, you know, read it. And even if they read it and say, wow, this is crap, it's still like they read it and you're like, wow, okay, I should make it better. So just that kind of like constant stream of feedback is so valuable. So what was the feedback like? How was the interaction with with people who were using and testing it out?
00:14:42
Speaker
Overall, very positive. A lot of the interaction happened via Twitter. The other thing that's really neat about the process for the book is that you can actually, if you're reading it online, there's an Edit button on each page. You can click on that button and that will allow you to edit the text of the book and then that's a pull request to GitHub, which I can then review and incorporate into the book. So it's really grown and overcome at all. Exactly.
00:15:11
Speaker
I don't know how many contributors we have in total. I should take a quick look. Yeah, I mean, there's definitely tens, like maybe a hundred contributors all up. So, you know, a lot of people incorporated, you know, some people would just kind of do drive-by type of fixes. Some people have like clearly really systematically read through the whole book and pointed out lots of problems. The art community has just been fantastic in terms of like helping me improve the book.
00:15:36
Speaker
And in fact, there are even things where like there was a, I guess I'd been sort of in denial of the power of factus for a long time, thinking that you could do anything with strings and my friend Jenny Bryan kind of finally persuaded me that that was, there were things you just could not do with strings and factus were the right solutions. And that led to the creation of the four cats package, which is a package for categorical variables.
00:16:02
Speaker
and it's also an anagram of factors that makes the new factors much easier. And so that, you know, that ended up being a chapter in the book. Right, right. So was the interaction with contributors, did that serve as the technical review and the quality control process? Or would that have its own separate track at the very end?
00:16:22
Speaker
Yeah, so O'Reilly also did sort of a technical review track, which turned out to be a lot of people who are already reading the book. Yeah, yeah. I don't know, that was useful. But I think it's sort of useful to have a few people who's kind of, like, feel compelled to read the whole book and get feedback. Yeah.
00:16:38
Speaker
Overall, I think that was, I don't know, I'd say the community feedback throughout the writing process was at least as useful as the final formal review. Right. How does this book fill an existing gap in the book landscape for books that are teaching R, how to code in R?
00:16:57
Speaker
I think it's fairly unusual in that it's like trying to, like firstly, it's explicitly about data science. It's not about, you know, statistics. It's not about forestry statistics or fishery statistics or like any of these many kind of specialized subdomains. This is sort of broadly about like the common 80% of tools that everyone is using are, but to do data analysis will use regardless of what domain they're using it for.
00:17:24
Speaker
So I think that's unusual. The other thing is that this book really kind of brought together, at least in my mind, this sort of idea of a tidyverse, which is a network of packages, an ecosystem of packages that are all designed to work together that kind of share common underlying philosophies. So that once you've learned one, you've got a heads up on many others because they share so many idioms. Do you want to talk a little bit about the tidyverse package and some of the other R development you're working on?

Understanding the Tidyverse

00:17:53
Speaker
Yeah, so the tidyverse is this sort of idea of this ecosystem of packages all work really well together and the thing that it's kind of enabled me to work on that is basically the name tidyverse because for a long time I think the best name for that collection of packages.
00:18:09
Speaker
was the Hadleyverse, which is a term I just absolutely cannot stand. I think it's okay for other people to use it, but for me to use it sounds so mystical. Amazingly, we went through brainstorm like 20 names before we figured out like Tython. Tidyverse just seems so obvious in retrospect. But having a name has really helped me think about, well, not just these individual packages, but how they fit together.
00:18:38
Speaker
And so one kind of immediate outcome of that is the creation of this tidyverse package, which allows you to do two things. First of all, when you install the tidyverse package, it basically installs every single package in the tidyverse. So you only need to run and sort out packages tidyverse to get everything that you'll need in most cases. And then when you load the tidyverse package and you do library tidyverse, it kind of loads the packages that I consider to be the core part of the tidyverse. So ggplot2, dplyr,
00:19:08
Speaker
tidier, per, and something else I'm probably forgetting. But basically, so now instead of having to literally every script with the same packages, you can just use these. I'm also hopeful it will help people learn about packages they didn't know about. I see a lot of people still using reshape or reshape2, which are now
00:19:30
Speaker
pretty ancient. And just generally thinking about the tidyverse, thinking about the packages, how can you make it easy for people to discover them, to learn them, to use them, to find about new stuff. So I'm also working a lot on package websites. It's very, very primitive currently, but you can go to tidyverse.org and see a very
00:19:53
Speaker
very simple HTML page, which just lists the tidyverse websites that are up and running so far. And then I've also been working on this package called Package Down, which is designed to make it as easy as possible to make a nice website from a package. So I'm putting a lot of time in the coming months about making sure all of these packages in the tidyverse have a good home on the web and a good place to start if you want to learn them.
00:20:19
Speaker
Yeah, so it sounds like what you're sort of creating is a pipeline that takes it all the way from getting some data all the way, not just to analyzing it and visualizing and getting those visualizations out, but actually being able to communicate them on a website or through a presentation.
00:20:34
Speaker
Exactly. The thing that doesn't quite fit in the tidyverse but is absolutely part of the important part of the data science process is that final communication step, whether it's writing an R mark down report or presentation or whatever, or creating an interactive Shiny app. Those don't quite follow the tidyverse because you interact with them in somewhat different ways, but absolutely incredibly important part of that, the data science process.
00:21:01
Speaker
Great. Hadley, thanks for coming on the show. This has been really interesting and I'm sure people are excited for the book, R for Data Science, which is coming out very soon from O'Reilly, Hadley Wickham and Garrett Grolleman. Hadley, again, thanks for coming on the show. Thanks for having me. Thanks to everyone for tuning in this week. I hope you enjoyed this week's episode. Until next time, this has been the Policy of This Podcast. Thanks again for listening.
00:21:36
Speaker
This episode of the PolicyViz podcast is brought to you by JMP, Statistical Discovery Software from SAS. JMP, spelled J-M-P, is an easy to use tool that connects powerful analytics with interactive graphics. The drag and drop interface of JMP enables quick exploration of data to identify patterns, interactions, and outliers.
00:21:55
Speaker
JUMP has a scripting language for reproducibility and interfacing with R. Click on this episode's sponsored link to receive a free info kit that includes an interview with DataVis experts Kaiser Fung and Alberto Cairo. In the interview, they discuss information gathering, analysis, and communicating results.