Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Episode #150: Learning R image

Episode #150: Learning R

The PolicyViz Podcast
Avatar
202 Plays6 years ago

Welcome back to this, the 150th episode of the PolicyViz Podcast! With four years of episodes (almost to the day!), I’ve had a lot of fun bringing you insights from people all across the globe doing amazing work in the...

The post Episode #150: Learning R appeared first on PolicyViz.

Recommended
Transcript

Celebrating 150 Episodes

00:00:11
Speaker
Welcome back to the Policy Viz Podcast. I'm your host, John Schwabisch, and welcome to the 150th episode of the show. On this week's episode, I'm going to flip things around a little bit to mark the 150th episode. And no, I'm not gonna talk about all the things I've learned and the great people I've spoken to for 90 minutes. If you've looked at your podcast player, you'll note that this episode is quite a bit longer than the usual 25 minutes or so that I aim for.

Special Episode Format

00:00:37
Speaker
On this week's episode, I'm going to flip the script around a little bit and I'm going to be interviewed along with a colleague of mine at the Urban Institute. So let me set the stage for you here.

Learning R with Aaron Williams

00:00:47
Speaker
Back in December of last year, I decided I was finally going to sit down and learn the R programming language. I had tried some online courses.
00:00:54
Speaker
I had tried some open online courses most recently with Andrew Tran from the Washington Post back in the summertime and probably like a lot of people's experience I got a week or two in and then work would take over or life would take over and I just wouldn't be able to finish the course.
00:01:10
Speaker
So I reached out to a colleague of mine, Aaron Williams at the Urban Institute, who runs our R users group and does a lot of development with R, has created some R templates for the Institute. And I reached out to Aaron and I said, look, I really need to learn R. Can you sit down with me for two days straight?
00:01:28
Speaker
just walk me through the whole thing. I want to learn how to read data and write data. I want to learn how to reshape data, make new variables, make manipulations to variables. I, of course, want to learn how to make data visualizations. And here's a list of graphs I want to make. And fortunately,

Two-Day R Learning Sprint

00:01:44
Speaker
Aaron agreed. And so we sat down for two days to conduct what we ended up calling a learning sprint.
00:01:49
Speaker
and so we sat down in my office and we used some of the built-in datasets in R and then I also had a dataset on hockey data that I had just had sort of sitting on my on my desktop and we ended up playing around with that a little bit
00:02:06
Speaker
doing some analysis of the data, making some graphs, and sort of making our way through the data set. And then I posted a little bit of a short Twitter thread describing what I felt were the advantages of that learning sprint environment. And Ben Stunhug from Stanford University reached out and wanted to learn more.

Interview with Ben Stunhug

00:02:25
Speaker
And so Ben offered to interview Aaron and myself for a podcast. And so we sat down, the three of us, and talked for what turned out to be
00:02:36
Speaker
over an hour to talk about the advantages of learning Sprint, how I approached it, how Aaron approached it, how we both learn different aspects of the tool. Aaron learning, I think, how to develop some of his teaching approaches. I, of course, learning how to use the tool. So it's a really interesting environment and one that may not be scalable, sort of in a broad sense, but certainly one that was valuable for me to learn the

Episode Announcement and Resources

00:03:03
Speaker
tool.
00:03:03
Speaker
So for this week's episode, to sort of mark the 150th episode of the show, this interview sort of flipped around. Ben Steinhaug from Stanford University interviews, Aaron Williams, a colleague of mine at the Urban Institute, and myself to talk about this R learning sprint. So it's a little bit of a longer episode, but I do hope you'll enjoy it and check out the resources for R and for learning R on the episode notes, on the show notes page.
00:03:31
Speaker
So here we go, the 150th episode of the PolicyViz podcast on learning R.

Introductions: Ben and Aaron

00:03:41
Speaker
Hi, I'm Ben Stenhaug, and I'm a PhD student in the School of Education here at Stanford involved in some of the data science things going on around campus. And I'm really excited about this, guys. How we got here was I saw John's tweet about what I think we're calling a learning sprint with an expert, which was his most recent attempt to learn R. And I've just become super fascinated with how people learn data science and learn
00:04:07
Speaker
are in particular because I see a lot of people using really effective ways and getting really good at R. And I see a lot of people struggling trying things like online courses or books or what have you. And either they aren't able to stick to that or they go through it and the content doesn't stick to them. So I thought this was a really interesting idea. I reached out to John after the tweet was really kind and was like, hey, let's have a conversation. And then in this day and age, it's like, well, if we're going to have a conversation, we might as well record it.
00:04:35
Speaker
And so here we are doing that. That's my background.
00:04:38
Speaker
Yeah, so I guess I'll start. So I'm John Schwabisch. I'm a senior fellow at the Urban Institute here in Washington, DC, an urban to nonprofit research institution. I do economic research here on areas like food stamps and disability insurance. And then I'm also part of the Urban Communications team, where I help colleagues here work on their data visualizations and their presentation slides and their presentation skills. And I also have my own website called PolicyViz and host my own podcast where
00:05:05
Speaker
This is also being played at policyvis.com and the Policyvis podcast. Yeah, and hi, I'm Aaron Williams. I'm a data scientist in the Income and Benefits Policy Center at the Urban Institute. I mostly do research on retirement policy, also some tax policy work, and I lead the Urban Institute's R users group. And we do a lot of different things, but one of those things is we teach people at the Urban Institute how to better use R to improve their research.
00:05:31
Speaker
Awesome. So, okay, this is great. Let's dive in. So, John, I'm super curious. How did you get to the point of wanting to learn R so badly and maybe desperately that you're like, let me find a colleague that knows R and take two days off out of your busy life and sit with them in what must have been a fun but also sort of anxious couple of days?

The Growing Importance of R

00:05:54
Speaker
Like, how did you get to that point, man?
00:05:56
Speaker
Yeah, that's a good way to kick it off. So I'll try to keep the story short. So most of my data visualization work is done in Excel. Now I have a programming background. I've for a long time used Stata and SAS and also Fortran and did C++ back in the day. But I never really jumped into R and
00:06:16
Speaker
You know, creating things in Excel has always been my workflow. It's also part of the Urban's workflow here where we put things from Excel into Word and that's generally how reports are created. So it sort of made sense for me to continue that workflow for most of my time here at Urban. And of course, R over the last few years has really taken off.
00:06:34
Speaker
especially on the data visualization front. And from my perspective, the data visualization capabilities of tools like Stata and SAS just lag. They're just not very good. They're not very customizable. And so I never really got into those tools. I do what I think a lot of other economists do, which is we do our analysis in Stata and SAS or SPSS or MATLAB or whatever it is. We find the graph or two or three that we want to actually put in the report.
00:07:03
Speaker
we winnow the data down to get that part of the things that we need to graph, and then we pipe that into Excel and we make a graph. But because R has sort of exploded, we have been sort of playing on the edges of learning it. And like any coding tool, it takes a lot of time, takes a lot of effort. So I've taken, I think,
00:07:22
Speaker
maybe two or three online classes, or I should say I started taking two or three online classes. And I think my most recent attempt at this was Andrew Tran's MOOC with the Knight Foundation that was I think taking place in maybe July and August of 2018.
00:07:43
Speaker
And so I signed up, you know, and I mean, I really love Andrew's work at the post and, you know, did the first week, you know, and I was like, okay, I'm going to carve out like, you know, whatever he recommended. It was like, you know, four to six hours a week or something like that. So you watch the videos, you go through and sort of work through and did that for two weeks. And then, you know, by the third week, you know, life sort of took over and I just, you know, just couldn't keep it up.
00:08:05
Speaker
So Aaron, he'll talk about in a bit, I'm sure, has sort of led Urban's efforts to have a lot of people using R. And he's coordinated the R users group, which he'll talk about and a lot of other things, a lot of other templates and tools and packages and libraries and everything, you know, people need just to be able to be good R users. And so the number of people who are using R here has really exploded. So
00:08:29
Speaker
I reached out to Aaron probably like early November. I was like, look, I really, at this point, like I want to learn R. I know it's something I need to know. Like I don't need to be Hadley Wickham level of R user, but I need to be able to make some things and I just can't do it with these online courses. So can we sit down? I can cover your time, but can we sit down and like just hack through a whole bunch of things?

Learning Approach: Hands-On Practice

00:08:52
Speaker
And fortunately he was like, yes, absolutely. Let's do it right now. And so we found a couple of days in early December.
00:09:01
Speaker
And I had sent them a list of things that I wanted to learn, things that like I use every day in Stata that, you know, I think for people who are probably listeners who know are like seem obviously trivial and easy, but like when you're first learning it, they need to be known. So like things like reading data and exporting data and reading CSVs and Excel files and Stata files, like how do you read those in and how do you export them? How do you set a working directory? And things I had sort of like nibbled around the edges of trying with these online courses, but
00:09:31
Speaker
There's something about the way that those courses work for me at least that the learning never really took hold. And I think the, sorry, I meant for this to go short and I've gone long, but I'll just say like two more things. I think one of the things about those online courses is that they provide you with the code.
00:09:50
Speaker
and all the instructors are like, don't copy and paste the code, like actually write it. But it's so tempting to just be like, oh, I'm stuck. I'm just gonna copy and paste the code. And so what I needed was someone just for my learning style. I needed someone to sit down with me and just show me step by step.
00:10:06
Speaker
And especially as you're going through, you know, you read in data, you reshape the data, you know, there's some little thing that's, you know, that doesn't quite work. And you want to drop a column or add a column or something like, there's all these little things about coding that I just wasn't getting from these online courses. And so sitting down with Aaron for two days, really helped me get over a bunch of those of those humps to be able to, you know, to start creating some things. Okay, so I'm going to stop there, because that was way longer than I intended.

Exploring Hockey Data

00:10:31
Speaker
But
00:10:31
Speaker
No, that was awesome. I think it's really interesting and it's an open question. You say, oh, those online courses didn't work for my learning style because I would copy and paste the code or it wouldn't stick or whatever. I think I would hypothesize that it's not just you. I think it's a very common online learning experience for coding and it's not like that's just, I think everyone is experiencing that. That's one reason I was really interested.
00:10:56
Speaker
Just to that point, I think, you know, when you think about college, for example, you have the lecture with the professor, and then if you get stuck, you can go see the professor in office hours or whatever and sit down and be like, I don't get this little thing. You can do that in these online classes, but you know, it's not the same. You know, the discussion forums where you post a question and they say, oh, try this or try that, or it's just not the same as having someone sitting down next to you.
00:11:20
Speaker
saying, look, if you type it this way, or, you know, let me grab your computer for five seconds and, you know, type this line of code in you so you can see it. It's just a different, different environment. Right. The answer you didn't give to why you wanted to learn our I thought you were good. I thought we were going to get all therapeutic and you were going to reveal some, like, deep regret from having Hadley on the podcast, but not knowing our and you were finally sort of letting down that guilt in some way.
00:11:46
Speaker
I mean, there is a little bit of that because people keep sending me books for R. And they're like, oh, check out my new book on R. And I'm like, yeah, it looks great. I'm going to put it right on the shelf because I don't do this. But thanks for the book. So there is definitely a little bit of field. And also, a lot of times when I'm teaching, one of the big questions when I teach data visualization is,
00:12:07
Speaker
What tools should I use? And that's a really hard question to answer because it depends on the individual's background and their skill set and their organization and how that workflow works and what ultimately they want to create. And so R is clearly one of the things that's in the library of tools that people should at least consider.
00:12:27
Speaker
And I feel like personally, JavaScript is more object oriented, is just far enough away from the way that my brain works at this point, that it's just not something that I'm going to be able to really pick up and do. But R is, you know, it's in those class of the statistical programs, right? It's akin to statons. I mean, there's there's even cheat sheets of how to go from, like, you know, a state of command to an R command. And so they are sort of close in it. And so I felt like for me, the slaying the dragon of learning R
00:12:57
Speaker
Was a smaller dragon than slaying the dragon of learning something like d3 which is just not gonna happen for me Got it. And so I mean, I really want to get into the nitty-gritty here So Aaron tell me where were you when John reached out? Like were you mid coffee sip? Was it an email? What was your reaction? You're like, this is this is crazy. I can't do this. Oh, I'd love to do this What was it like for you?
00:13:20
Speaker
Yeah, I'll back up a couple of years to answer that question and say I learned R through the data science Coursera taught by Roger Pang and Jeff Leek and Brian Caffo starting in 2015. And for like the first year and a half that I was an R programmer, I didn't know a single other R programmer. And so when I moved to DC and came to the Urban Institute, I was just really excited anytime I met anyone who cared about R.
00:13:46
Speaker
because I finally felt like there was someone to share my enthusiasm with me. So when John sent me the email that he wanted to sit down and talk about R for two days, I was like, this is my dream. Let's do it. How soon

Coding Sessions and Problem Solving

00:13:58
Speaker
can we do it? Does it have to be just two days?
00:14:01
Speaker
like the heavens opened and john came down wanting our help. Exactly. I just I really have a passion for sharing or and so anytime anyone is curious about it, it's really motivating and energizing. Awesome. And so then where were you when john reached out? Yeah, I guess I was I was in my office, he emailed me on my work email. And I think
00:14:21
Speaker
I went to peer into his office and he wasn't there that day and I was a little disappointed that I couldn't know. Share my ideas and face to face. Yeah. Yeah. But, and your reaction was just like, I'm in. Absolutely. Let's do this. Oh, I'm definitely in. And a big part of my job here is, is dealing with people exactly like John who, you know, they are experienced in SAS or experienced in Stata and they want to do something they can't do in one of those packages and they reach out to me and they say, so tell me, you know, about this R thing.
00:14:50
Speaker
and maybe help me get to a point where I can use it to improve my research. And I think that's one of the most motivating parts of my job. And was John a little bit different in that he wanted to fully learn the two things? So I do some sort of consulting for students at Stanford and I always have this difficulty of people come in and they're like, I want to do this thing.
00:15:09
Speaker
And then it's, oh, well, there's a day and a half of prereqs for you to really understand what's happening. Like I can write the code to make you do it right now, but I'm not teaching you to fish, right? And so it must be fun to have John wanting to learn to fish. Yeah, I think that was, uh, the thing that stood out as different about this is that he, he wanted to commit two days to it and really, I think understand sort of the why instead of just like, take me from point A to point B. And obviously like there are lots of people with limited time who say, I want to learn R and then they never get around to it. And so.
00:15:39
Speaker
having the ability to sit down for that amount of time in a short period of time, definitely was different and I think successful. The other thing I'll say is that Aaron has started this, our users group, which he can talk more about that meets every week. And, you know, I've only been able to attend a couple of them just for, you know, various scheduling conflicts. But one of the things about attending a group like that is I think, and I don't think this is also, I don't think this is uncommon either is that, you know, you feel a little,
00:16:07
Speaker
shy, a little embarrassed, maybe if you don't have the skills. And it's not like, you know, I subscribe to the belief that there are no dumb questions. But when you're in a room of, you know, even 10 or 15 people, when you're asking these questions of like, okay, how do I reshape my data, right? Like there are people in that room that the variance and skill sets is gonna be pretty wide. And so even we have, you know, Aaron set up these weekly meetings,
00:16:31
Speaker
Even there, it was not a place that I was like, I need to get halfway up the steep learning curve. Once I get halfway up the learning curve or some percentage up that steep initial slope, then I can start going to these meetings because now the questions I feel are not so elementary, and now I can start leaning on other people in the organization to help me with some of these key questions.
00:16:54
Speaker
So we meet every Friday for an hour. It's called the R Lunch Lab. And it's usually 12 to 15 people that are interested in our programming. And John hit it exactly. One of the biggest challenges is how do you.
00:17:05
Speaker
get people together in a room for an hour if they have such a wide range of abilities. We do different things to try to address that, and I hope John starts coming now that he has that experience. It's so interesting you talk about needing to have some, get halfway up the learning curve. It almost reminds you of someone that wants to get in shape, and so they want to start going to the gym, but they feel so out of shape that they're embarrassed to go to the gym, so they work out at home for a while.
00:17:32
Speaker
And then once they're good enough to like at home, they they're able to go to the gym and then they get really good, which seems it's just, I don't know. That's a whole nother topic, but how you make those environments like user groups.
00:17:46
Speaker
actually feel safe for people at any skill level, because it's, I don't know, it feels like we have such strong systems that say like, hey, people in this room know how to reshape data, don't ask them how to reshape data, despite the leader being like, there's no dumb questions, everyone's really kind, we just interpret things in sort of a scary way when we don't know things sometimes, I think. I mean, there's a whole like, it's average Joe's gym, right? Like, there's a whole thing about like, you know, you want to go to the gym and like, you know, it's a judgment free zone. And
00:18:15
Speaker
And that may totally be, right? It may just be my personal perspective, right? It's like the thing I have to work out on my own that no one's judging. But that's, you know, but that's, you know, just, I think that's just human nature, where, you know, you don't want to sort of hold the group back. But to get back to how we actually did it. So I have a little roundtable here in my office, and
00:18:38
Speaker
I had my computer, everything, you know, I sort of like I said, I gave Aaron a list and I know enough how to update our and update our studio. And so I was all set to go. And I actually had a project, a data project. So the plan was actually maybe, Aaron, do we do we figure this out the day of I think I think like you had set up using some like core our data. And then I said, oh, I have another data set that we can play around with.
00:19:07
Speaker
Yeah, which, you know, one of the great things about ours, there's built in data that are really easy to access, you can start showing people concepts really quickly. But I think John wanting to use his own data set ended up being really good for both of us, because it forced me to slow down. Because instead of being like, Oh, yeah, this is the diamonds data set, I know this, this, this, and this, like, I didn't know it going into it. And so
00:19:28
Speaker
I kind of was approaching at the speed of an analyst using R to answer questions that they don't know the answers to. And so we decided that day to use this NHL data set that John had acquired. Right. So yeah, so we sat down in my little table here with my computer. And then Aaron also brought this little booklet. Well, he can explain it better than I can, but basically has all the cheat sheets from RStudio sort of stitched together. And so we were able to sort of walk through all the primary parts that I needed.
00:19:57
Speaker
Um, for my list, right? So I wanted to learn how to read and write data. I wanted to reshape data. I wanted to create new variables, you know, delete columns. I wanted to learn how to merge data. And then ultimately I think like day two, we got into, into ggplot and started making graphs.
00:20:11
Speaker
Um, and so that resource and, you know, also again, it's, you know, these things, they sound, I don't know, they sound elementary, but they also like seem to me also in retrospect, like obvious, like you have a cheat sheet with, I don't know how many pages are in there, but you know, it's like 10, 12 pages.
00:20:28
Speaker
You know, where do you look to find the snippets of code that you need to do the various things you can do with reshape? Like it's not entirely obvious. You would have to, you know, as an individual, I have to go through, basically read each line of each page. Whereas with Aaron sitting next to me, he can say, look, we're going to talk about, you know, reading in the data. And so here's the dply or two pages on dply, right? And so here's, here's where you could look, or we're going to do, um,
00:20:52
Speaker
We're going to do the graphs. Here's the two pages on ggplot. We're going to see if you can test my knowledge here. We're going to do dates so we can look at the Luper date package. Yeah, see,

Reflections on Learning Sprint

00:21:02
Speaker
I'm learning. I'm learning. Yeah, so those sorts of resources that he brought as well, and also things that we looked at on the web, where that I think was really helpful, again, to point directly to certain things.
00:21:15
Speaker
Speaking of those resources, so Aaron, how did you prepare? You have this list, you sort of know the goals, what are you doing for preparation? Yeah, so a lot less than I would have done in the past. So a year and a half ago, I did a full six one hour session intro to our series for the Urban Institute. And I prepared a bunch for these conversations. I had slides, I knew everything I was going to say, and I did a specific section on functions.
00:21:45
Speaker
And I got to the end, no one was with me, no one had understood what I had said, but then people lined up, you know, and they all wanted to ask me these questions one on one. And what I realized is, you know, I can answer 90% of those questions without any preparation. So that really shaped my philosophy for teaching R. I think I have a lot of resources that I can reference, and then I just try to meet the person where they are. And the kind of biggest things I want the person to leave the room with, and I think John was hinting at this, or saying explicitly is,
00:22:16
Speaker
You know, if I don't know the answer, I want to show them how I find the answer, right? I want to show them this is the cheat sheet that I reference, or this is how I search on Google, or this is how I search on Stack Overflow, right? So that's the most important skill that I can teach them. And then I also kind of want to leave them with artifacts that they can reference after the fact to, you know, kind of fill in those blanks, the different things they forget. So, you know, John sent me a list of things that he wanted to know how to do. I kind of have a generic list of things that I think people should know how to do.
00:22:45
Speaker
at the Urban Institute when they start using R. And then I just try to get in the room, make everyone comfortable, and then sort of step through things as if I was doing it myself and explain sort of each step of the process to the person who's learning.
00:22:58
Speaker
That sounds awesome. The point about preparation is super interesting to me. Cause we talk about teachers should prepare, but maybe in some cases preparation means I have a plan I'm going to walk through and I'm going to lose some sort of empathy or reading or tailoring it to the, to the person, right? Like I know on day two, I was going to explain function. So here we go. Despite not taking any input of, of where we're at or where the student is at. That's super interesting to me.
00:23:25
Speaker
Yeah, I want to put myself in a position where I'm okay throwing away all of my plans. Right. You know, I don't want to leave, I don't want to leave the room behind, just because I think this is where they should be right now. And I'm not budging from that. So yeah, not nothing could could be worse. So what? Okay, I want to get like super, I want to be able to just literally imagine this in my mind. So what is it a Tuesday? Is it a Wednesday? What day is it when you guys get first? What do we do? We did Monday and Tuesday? Yeah, Monday and Tuesday. So I have a little meeting.
00:23:54
Speaker
on Mondays from 9 45 to 10 so I got it in early everything was installed and I ran it or Aaron I said I have to go upstairs for this meeting so let's let's you know 1015 or whatever was let's let's sit down and start so 1015 open the computer he walked me through this booklet with I had like the dplyr and the gdplot and the lubricate the cheat sheets from from our studio that you

Understanding R's Philosophy

00:24:16
Speaker
know apply a lot of our users probably have mostly memorized by this point
00:24:21
Speaker
And then, yeah, we just sort of went through my list of things that I wanted to make sure that I covered. And you were in our studio?
00:24:31
Speaker
Yeah. In our studio. Yeah. And do you remember that? What was the first line of code you guys wrote and who typed it? Was it, was it John or was it Aaron? Okay. So Aaron, why don't you go? Because my, my recollection is you all, you brought your computer to set up the first our script. Um, we're sort of showing me how you construct your own file system. Yeah, exactly. So I love our studio. And I think one of the great things about our studio is the ability to use our projects. So we sort of set up a directory, we set up an R project.
00:24:59
Speaker
for keeping all the resources in one place and being able to use relative file paths and all the different sorts of advantages that our projects get you. I think in general, I try to touch students' computers as little as possible. If I actually want to show them something, I'll type it onto my own computer. So I think at that point, usually I start with dplyr and I teach people seven to eight functions. So filter for filtering rows, select for picking columns, mutate for creating new variables on and on.
00:25:27
Speaker
And so we probably opened up an R script. And the first line of code I would bet is a library tidyverse so that you can load all the packages that Hadley Wickham and company have written for doing all these great things in R. And we just started stepping our way through those seven or eight functions. So you were typing on your computer, and then John was typing on his computer too? Usually, no, I'll describe things to him. And then if there's something that's particularly problematic,
00:25:56
Speaker
I don't quite understand the pipe or, you know, what is this assignment operator? Then I'll show it to him, but I'd probably only type a 10th of what, what John actually ends up typing into his document. Yeah. So I'm opening up the, I have the code in front of me. So the tidy verse, um, now there's some spaces here, but the tidy verse install line was line 18. So actually the first thing we did was just assign numbers to, to an array, create some characters, sort of reference the array.
00:26:24
Speaker
calculate the mean and then assign the mean to a variable. And then we installed the tidyverse and then we loaded the diamonds dataset. And then the next set, like I can show you like, I mean, I can list through. So he also showed me like, you know, the right way to do comments, which, you know, again, the right way and the wrong way is, you know, partly, you know, preference, but.
00:26:45
Speaker
Like we did selections, we did renaming, we did filtering, we did arranging, and then we got into mutates, and then we got to the pipe. When you're doing assignment there, just to back up slightly, is it an equal sign assignment or using the assignment operator in R? Because I imagine that was a controversial conversation if it happened.
00:27:03
Speaker
So, okay. So, so to back up real quick. So like I said, I had, I tried different types of, of our courses and I, you know, I had actually tried on my own several years ago to just learn R and base R. And so the style, the sense I'm not exactly sure the right word here, but like the overall look of the syntax, I kind of knew what the R code is going to look like. And I had had back then in the day, which is not that long ago, but like.
00:27:28
Speaker
I have a few scripts in base R to make, you know, a map and a couple other diagrams, but like, I didn't really know, like, if you were to say to me, oh, go make a map, I'd be like, well, I have to, you know, find that old script and bring it up and modify it, which is not knowing how to code it. Right. That's like, just cobbling together.
00:27:47
Speaker
And so really, I mean, really to the to the nut of the whole thing is I didn't want to have scripts that I downloaded that I could, you know, I could do this, right? I can go find how to make a core path map somewhere and I could I could grab someone's code and I could I could pop it in an R and run it and, you know, hopefully it would work and maybe make some edits. But then like the next day, if you said, well, can you make me a map? I'd have to say, well, I have to go find the code I just, you know,
00:28:10
Speaker
find that old code, I don't really understand it. But I can, you know, and now the whole goal is to be able to say, can you make a map? And now I could say, yeah, sure. You know, now I know how to read in the data and call the right functions and

R vs Other Tools

00:28:21
Speaker
this and that. So to really, you know, understand how to do the coding. So I think for me, I'll throw it back to Aaron, actually. So Aaron, what do you think was my biggest, like mental gymnastics that I had to do? Interesting. I think people usually do really well with dplyr.
00:28:37
Speaker
You were really successful with ggplot because you've seen me kind of give the talk about the grammar of graphics a couple of times. I think the first thing that gets really complicated with people who have some data experience is getting into tidier, right? So the whole philosophy of the tidyverse, which I subscribe to is kind of get your data into this one format, right? And then all the tools work really easily with this format. And so if you're using tidier, which basically takes untidy data and makes it tidy,
00:29:05
Speaker
by doing things like going wide to long and separating variables and things like that. That's where people start to kind of have hiccups mentally. And I think that was probably one of the tougher sections. I think that's exactly right. So understanding that this is like, when you create a subset of the diamonds data set, for example, you call it something and then you use the select function and you take diamonds and you take the things out of it. Whereas in Stata, for example,
00:29:35
Speaker
You would say, you know, drop price and carrot and, and whatever, right? You would drop those or you would keep those. And it's just like the, the syntax is obviously different, but also just the philosophy of how you think about creating subsets is different. And also that you can hold.
00:29:54
Speaker
And this, I think, for me, at least, is one of the huge advantages of a tool like R, which is you can hold multiple data sets in memory at the same time. Whereas in a tool like Stata, you can only do that if they're merged together. You can't hold data set A and data set B simultaneously and refer to them differently. You have to use one, do something, save it, and then load another one. Yeah, and so I had seen enough ggplot where I knew
00:30:21
Speaker
OK, so there's this AES, there's this aesthetics function argument, and we're going to pop stuff in there. So I sort of knew the basic outline of how ggplotwork, but not obviously the details. But we started with the actual data wrangling side and cleaning side of things, which is where things should start and be able to build that and be able to take the data set and get to a place where you'd want to graph something.
00:30:51
Speaker
And so yeah, I think there are those kinds of things that were like mental gymnastics coming from a world of Stata and SAS where, like the pipe command is another one, right? Like that's a hard thing. Like you can't, that's like multiple lines of code. I mean, it is multiple lines of code, but they're together, right? Whereas in Stata or SAS, well, Stata, they're separate lines of code and in SAS they might be in a proc statement. So you have all these things in a proc statement. So those sorts of things are just different ways of thinking. And that's what,
00:31:21
Speaker
I really needed someone to sit down with me and help me work through

NHL Data Analysis

00:31:24
Speaker
that. Yeah. How long did you guys work with diamonds until you switched to the NHL data site? And is that National Hockey League data, or is that something else? Yeah. So I have a whole story about the NHL. So I don't have the timing here. So this first script we wrote was 190 lines. And it ends at the very end, line 165, we get to read and writing Excel data.
00:31:50
Speaker
And that's where we pulled in the hockey data that I had on my desktop. And I can talk a little bit about what I want to do with that.
00:31:57
Speaker
in a bit. But then from there, that's when we started getting into, okay, we we've played with this diamonds data set, let's play with this new data set that also Aaron hadn't seen before, right? So I don't know if this is like a good teaching strategy. I think, you know, Aaron's probably better question for you. But like, you know, I always worry about when I'm teaching, like, you know, trying to solve a math or statistics problem, like,
00:32:21
Speaker
off the cuff. There's always like a dangerous game, right? And so we're working with this new data set. And so, you know, there were times when something would come up and Aaron be like, Oh, I haven't, you know, thought about this thing, or I haven't had this challenge. So let's, let's solve it. And so like Aaron was saying earlier, like, it's great to see, you know, watching them try to solve the problem, right? Like, where do they go to solve the problem? Like what part of the cheat sheet, you know,
00:32:45
Speaker
where in Hadley's book, you know, where do you go to find the answers? And so once we got the NHL data, and then we started playing with that, and we messed around with the data, and then we did some transformations to some of the variables, and then ultimately we got into ggplot on day two. Yeah.
00:33:00
Speaker
I'd love to hear more about the NHL data because I think something I've been thinking a lot about and the 538 packages very much developed in this vein is the complexity of the data should match sort of where the our user is at. So you can imagine if someone's very basic, let's make this a very curated data set with very few problems. And then if someone's ramping up, let's leave the NA values in, let's make it have a ton of rows, ton of columns.
00:33:26
Speaker
So I'm curious, yeah, tell us more about the NHL dataset, why you're interested, what you learned, how my Minnesota wild look, and what the shape of that data was in. Yeah, yeah, yeah. Okay, so that's, okay, so this is sort of a funny story, and a project that I've been meaning to do for a long time. I just, I still haven't gotten around to it.
00:33:46
Speaker
So in 2017, my wife and I bought an Xbox for my kids and they got, my son really wanted FIFA soccer and so he got that and he also got NHL, one of the NHL games, I think NHL 17 or whatever it was. And as the caps were making their run, see I had to get that caps.
00:34:08
Speaker
as the Cavs are making their run in the spring to winning the Stanley Cup. I'm just going to keep going back to that. He and I were playing a lot of a lot of NHL on the Xbox. And what happened a lot was that he would score on me in the last minute of like every period. And I don't know, like either I stopped paying attention or I got tired or whatever, but he would score a lot and not just a third period of like different periods.
00:34:35
Speaker
So I got curious as to does that happen in the NHL for real? And so it turns out that that data is not readily available. So hockeyreference.com is my go to for the most part to get the hockey data. And it turns out that they didn't have a data set sort of collected together, but they did have all the box scores listed separately. So I solicited a friend of mine to
00:35:01
Speaker
scrape all the box score data down. And so what I ended up getting or pulling together eventually was a data set back to I think the mid 90s maybe of every goal scored in every game. And so I have the time of every goal in every game. And so what I can look at is, you know, the timing of those goals.
00:35:22
Speaker
And i was sort of playing around with it in excel most of the time to sort of like oh this is kinda cool and i talked to a friend of mine who was great at tableau and i was like well maybe you can make a fun dashboard and i could you know just make a couple graphs and put up the blog post and just put the date out there and i need to send it actually back to hockey reference and i haven't done that yet.
00:35:41
Speaker
lot of things still to do. But basically to the interesting thing is that you actually do see these small spikes in goals at the end in the last minute of the first and second period, which I think is kind of interesting because that's not what you would really expect. And then of course, you find a big spike at the end of the in the last minute of the third period, you obviously find a huge spike if you include empty net goals. And people who understand hockey will understand how this works. We don't need to go into
00:36:08
Speaker
But you do see like a big spike at the end of third periods, which also makes sense because when teams pull their goalie to get the extra skater, that's not technically a power play goal. So you can't really control for that when they have an extra skater. But you can see these spikes.
00:36:26
Speaker
You know, I've got a lot of this data, right, and I can collapse it down by team by year, like look at playoffs versus a regular season, all these things that, you know, I haven't gone anywhere near to doing, although I'd like to but that was the data and so we
00:36:43
Speaker
I'm just looking through my code here. So we have two different R scripts. We're reading the data. So back to what we did. So we loaded four packages. So tidyverse and this first tab that I'm looking at, the first script. Loaded tidyverse, the read Excel package, Lubricate, and gd highlight. And so that's where we started. So I had one data set that had everything, regular season plus playoffs, and another data set with just the regular season. So one thing that we worked on doing was,
00:37:11
Speaker
reading in both data sets, merging them to be able to identify the playoffs because I didn't have a variable that marked when playoffs occurred. So we had to load both of those in, merge them to identify the ones that were the playoffs. And then we looked at things like the gap in the scores, the frequency that the home team wins. We had to calculate the NHL season.
00:37:37
Speaker
because of the way that the data were constructed. We did win percentages by season and I think that's where we stopped on the like the tabs and then we kept going. So we did a lot actually with the dataset. We ended up, I don't even remember what we ended up creating. Oh, I think that we ended up just creating like
00:37:56
Speaker
basic graphs, you know, line charts and bar charts and small multiples with the goal being, you know, I think that what I really like about about especially GG plot is that once you understand the syntax,
00:38:12
Speaker
that there's a ggplot and there's this aesthetics function where you're going to put in a few arguments. And then there's a geome. And once you figure out what geome you need, it's not plug and play. It's not as easy as plug and play, but it's pretty close to that. And so we were just walking through these things and showing
00:38:32
Speaker
how to group things and create a mean within the argument and then, you know, create these plots. So yeah, it's a fun data set that I haven't done nearly as much as I want to, but maybe one day I'll get a couple of good graphs and just put the whole thing out there.

Teaching Strategies for R

00:38:48
Speaker
What's a row in that data set? Is it a goal? No, a row in the data set is a game.
00:38:53
Speaker
So what you have is a row is a game. So you have the date, uh, which is, you know, the name of the home team, name of the visitor team, and then let's just open it the easier and then try to like do this from memory. So, okay. So, so I'm opening it. So I have 1995 to 2008. So each row is a game. So the first variable is the URL, then the date, um, then the home team.
00:39:21
Speaker
then the visiting team, then the score for each team. And then there's a set of, let's see, what is this? One, two, three, four, five, six, seven variables for each goal that's scored. So I've got the period of the first goal, the time of the first goal, the team of the first goal, the score of the first goal,
00:39:41
Speaker
the first two assists of that goal, and then whether the goal is an empty net or a power play, right? So the first goal took place in the first period, you would see first period 146, Detroit scored it, you know, Steve Eisenman scored it, assisted by Ray Shepherd, it was a power play goal. Or was the first goal in the second period, you know, you have that have that all filled in. And so then this goes out to, I think, like 19 goals or something, 15 goals. Actually, I think it went out longer. I think we found some errors in the headers here was another challenge that we had to fix.
00:40:11
Speaker
And so yeah, so each goal gets its own entry in these groups of variables. And so it's really wide then, and then for a game that say is one zero, you just have a bunch of NAs for the second goal, third goal, fourth goal, fifth goal, which we should get to, because the way you structure data is really interesting. It's a very untidy way, which I'm sure Aaron sort of realized as it went through. So Aaron, I'm curious, do you remember,
00:40:38
Speaker
Well, so a theme that's coming out here is the learning sprint with an expert using a project of interest to the learner that the expert hasn't necessarily seen. So I'm curious, and do you remember the first time you were sort of marching along this problem? And then you were like, Oh crap, I don't know how to do this. Or it's throwing an error that I don't actually know exactly what's happening. So now we're going to have to debug together.
00:41:02
Speaker
Yeah, I think the big one was the data were sort of in two sets, right, with the regular, well, I guess all the games and the playoff games, but in the all the games data set, there's no variable that says this is a regular season game. This is a playoff game.
00:41:17
Speaker
So we had to read in both data sets and then essentially find the complement of the two and label that as regular season, and then the rest is the playoffs. And I hadn't had to do joins like this before, so that was an interesting problem that took us a second to...
00:41:35
Speaker
to figure out together. And then the other other thing that was interesting was just working through Lubricate. So I work on Social Security mostly, which is a 75 year horizon. So we never think about things below a year. And I love Lubricate, but I never get to use it because
00:41:53
Speaker
You know, the difference between February of 2042 and March just isn't that significant. And so we had to go in and we had the calendar dates of the games, but not the seasons. And so we basically had to come up with a little way using Luberdate to say, you know, if the game starts after this date and ends before this date, then it belongs to this season.
00:42:13
Speaker
And I remember both of those problems, you know, we had to really sit and think about it together. And it was fun because, you know, half of it is, you know, our question and half of it's sort of a subject matter question, right? You know, when does the season begin? Oh, is that preseason? Is that regular season? And when does the season end? And I'm not a big hockey fan, but it was interesting to kind of have an applied problem to learn on that wasn't diamonds.
00:42:35
Speaker
So when you came across something like the dates or like the figuring out which type of game it was, how did you guys figure it out? Aaron, did you say, Oh, pause, John, let me figure this out by myself on my computer. And then I'll explain it to you. Or did you guys like get the whiteboard out and start talking it through or how? I think there's sort of two parts to it, right? The first
00:42:57
Speaker
part of the answer or the approach has nothing really to do with R, right? It's like, what is this subject matter answer, right? It's like, when does the hockey season begin? Or, you know, these data, what information do we have that we could use to answer this question? The second question is now based on that approach that we've come up with, like, how can we execute this in R? And so the first part, we're pretty much on
00:43:23
Speaker
even footing, or John knows a lot more about hockey than me. So that's very much a conversation. It's like, you know, what can we do? And that's something that people at the Urban Institute are already really good at, right? They know their data, they know their subject area, right? But then maybe they use Stata, maybe they use SAS. So that first part, people here are pretty good at. So then the second question is, now that I know that, how do I execute this in R? And if I don't know the answer to that, there's a couple of different ways we can approach it. So I think John referenced this. RStudio has all these great cheat sheets, right? They're kind of a page or two.
00:43:52
Speaker
for each R package, they look at kind of different sets of tools that used to answer questions. So, you know, oh, let's go look at dplyr, right, which so we can look at the different types of joins that we have, and maybe answer the question that way. If we can't figure it out from using the cheat sheets, maybe we'll shift over to Google. The one tip that I always give for people with Google is set the search window to the past year. Because if you start googling things,
00:44:18
Speaker
for R, maybe you're gonna get an answer from seven years ago that's just suboptimal. So maybe set the search window to the past year. And then- How do you do that? Just really quick? It's like under settings. I can pull up Google right here. No, that's a great, I don't do that. And I'm like reading like 1995 listservs. I actually forgot about that. And it's like, it's super, because it's one of the big frustrations for me now using R is like,
00:44:44
Speaker
You work on some solution, and then you realize that this isn't working. And then you try again, and you go a little bit further down the Google search, and you find, oh, there's a package to do this now, right? Then you just call it. So yeah. So you go into Google, you do a search, and then to the far right, there's Tools.

Balancing Struggle and Guidance

00:45:04
Speaker
And under there, you can select any time, past hour, past 24 hours, past week, et cetera, et cetera.
00:45:12
Speaker
And so usually I tell people do the past year and that narrows it down. Uh, you know, the tidy versus exploded in R and so you're more likely to get a tidy verse solution to your problems instead of ending up with this zombie code of, um, you know, base R and tidy verse and all these different tools, um, that maybe don't make as much sense together. That is unbelievably useful.
00:45:33
Speaker
You might have just saved me like 20 hours of coding in my next year. Like, wow. Now everything I just did, like how do I separate rows in R and before I was getting all this crazy stuff and now I'm just getting seemingly super useful answers from Google. Wow, that's amazing. It's the smallest things that make the biggest difference for people sometimes.
00:45:55
Speaker
Which is exactly why I think this methodology, that is not something I don't think someone would put in a Coursera lesson, but it's something that's really valuable and it's this tacit information that I feel like is being transferred when you're sitting down working next to someone that's almost as valuable as the actual high-level concepts. That's really, really interesting to me.
00:46:20
Speaker
But just to your earlier question, like what was the challenge? So one thing that we discovered that I didn't know. So I think the data set goes out to like 15 goals. And so we read the data in and we're noticing, we're doing a couple of things and we notice that R is reading in, there's a bunch of variables that aren't named that are going out. So we have to sort of figure out what is going on in this data set. And so, oh, well, there's actually like the maximum number of goals is like,
00:46:48
Speaker
You know, 20, but the guy who scraped the data for me, he just didn't name those variables at the top of the top of the Excel file. And so we had to go back and sort of, you know, work out a solution for that. So that, so that's an example of like, you know, we had to explore the learn the data really together and then work on a solution together. So it was really still me typing.
00:47:09
Speaker
with Aaron saying, okay, let's try this. Oh, okay, let's try this. Oh, that worked. Okay, so now let's add this to that. So that was sort of the approach I think that we took. And there were a few of those things that would pop up here and there. That's how data analysis works, right? There are all these weird things going on and you have to sort of, you know, tackle them. Yeah, that's awesome. Did you always know what was happening? Was Aaron ever like telling you to write code that you didn't know what the purpose was for? I'm sure.
00:47:39
Speaker
The pipe is a good example. So write NHL this pipe character command. I don't know what you should actually call its characters, I guess. And then some other line of code. OK, so let's just write that and now run it. And now you can see what happens. OK, so what happened? Well, blah, blah, blah happened. And it's because we use this piping and that's what this thing does. So you have this moment of like, well, why am I doing this? But you quickly resolve that question.
00:48:07
Speaker
I think John's hitting on something interesting there because there are some things in R where it makes a lot of sense to like describe the theory and describe it with words and say, so this is going to work this way. And then there's some things you just have to like be like, see, see what it does. Like, and then people kind of can understand it from seeing what it does. The pipe definitely fits into that second group, right? You can describe it, but at some point you just need to show them the code. Whereas like ggplot, you can, you can kind of
00:48:33
Speaker
There's a whole book about the theory behind it. You can kind of build it up from first principles. And then the syntax reflects those first principles. Did you guys have, when you were describing that, John, it brought a really funny memory to me. Did you have any moments like this? Like I will be telling someone how to write R code and I'll be like, all right, write NHL the pipe. And then they'll type in literally NHL the pipe. And then like, I'll be like, no, it's like a percent sign. And then.
00:49:01
Speaker
And there's some funny YouTube video to be had there of people translating word code into actual code. And then eventually you're just like, give me the computer. Here it is. This is what it looks like. I think that was part of the reason why this works. So Aaron, I think, is great at really letting me, I'm sure he does with others too, but letting me struggle for just enough time.
00:49:25
Speaker
Like that's the whole thing about learning code I think for a lot of people is they struggle and you get to a point where you're struggling where either you're like, screw this, I hate this, I'm never doing it again. Or you say, I'm gonna power through it, I'm gonna figure it out. And I think when you are first starting out, that's a real crucial moment, right? And so I think reason why this little venture works so well is because Aaron is really good at sensing when I was at that inflection point.
00:49:54
Speaker
and saying, okay, I could, you know, he could see that I wasn't quite getting it. Okay, so if you do this, you know, and then resolve it, right. And so I think that's a key to the whole thing. Whereas when you are doing these MOOCs, because you have the answer, if you get to that inflection point, and you say, I'm not going to do this, or I'm going to power through it, the

Continued R Practice

00:50:17
Speaker
inclination, I think, for me, at least I'm out maybe not from most people, but certainly for me is like, I'm so frustrated. Let me just go look at the answer. Like, it's like the answers in the back of the book. So I'm just gonna look at it. And that is not the way in which you actually learn how to do this stuff on your own. Yeah, I think there's sort of two parts to that. So I studied saxophone for like 15 years, including college.
00:50:38
Speaker
And so a big part of me being there in person is to kind of be a commitment device. Like, you know, you show up to your piano lesson, you show up to your saxophone, the professor being there kind of motivates you to not cut corners. But I've also benefited from having a lot of really good teachers, and in particular teachers in one on one situations, and sort of learning that, that sort of pace or that
00:51:00
Speaker
that intuition for like now is the time to show them how to do this now is the time to let them struggle just because I've had so many opportunities to be around great teachers learning saxophone and so that definitely informed kind of my approach to teaching R.
00:51:13
Speaker
This is one of the most fascinating things that I think came out of that Twitter thread. It's this idea of productive versus unproductive struggle. I'm going to quote you on Twitter here, John. I don't know if this is the first time that this has ever happened on your own podcast, but this may sound like a super minor thing, but it's not. Struggling for a few minutes is one thing, but struggling for hours is totally another. Aaron's approach of letting me try for a bit and then jumping in when it was clear I was stuck was immensely
00:51:40
Speaker
helpful. And that's super interesting, right? We have this idea of, oh, to learn to code, I need to really stop and struggle and feel the pain. But it's like, if I've been trying to figure out how to set my working directory for an hour, like me sitting here and sweating more and trying more things and getting more errors is not helping me learn. It's just a really negative experience for me that's going to deter me from using R. It's way better if Aaron says, hey, use this command.
00:52:08
Speaker
That's right. And especially when you're first starting out and something you know is minor, like you know, reading in the data.
00:52:15
Speaker
is not the biggest hurdle you're going to have to get over. But if you can't get over that hurdle, then you know, then it's really frustrating to be like, okay, how am I going to create a core cliff map, you know, and move this and that around? If I can't figure out how to read the data. So yeah, I think a lot of people have that have that experience. Yeah, I think definitely informs our kind of broader approach here at the Urban Institute. So
00:52:43
Speaker
I definitely subscribe to sort of the 80-20 rule. People here when they're working on their own project work, if they have some more experience, they can do 80% of the work with 20% of the time and 20% of the frustration. And we have kind of on call support where people can send me an email or give me a call and I kind of help them with that last 20%. And pretty much the rule that I tell people is if the second it stops being fun, then you're probably approaching something incorrectly, right?
00:53:10
Speaker
not understanding some concept of the tidyverse, or you don't know the right function or something, or the second it stops being fun, that's your warning light to reach out to me.
00:53:18
Speaker
And that seems to work really well for people, I think both kind of in this conversation with learning R, but also if they're using R for their work from day to day. You sound like you have such a cool job, Aaron, at the Urban Institute. That seems amazing. Like CPR, capital R or something. Yeah, I'm really fortunate to have been stubborn enough to carve out a little role like this at the Urban Institute.
00:53:41
Speaker
I would love to talk about, so something that came out in the Twitter thread too while I have it up is this idea of wide versus long data. And it sounded like, John, that was sort of unintuitive to you to have your data long as opposed to wide. And in particular, I imagine that must have happened given the way this NHL data was structured, that it was super wide and just had arbitrarily number of columns, depending on what the maximum number of goals were scored.
00:54:07
Speaker
NHL game in the last two decades or something. So how did the wide versus long conversation go? How did you approach it?
00:54:14
Speaker
Um, where are you at now when you're learning struggle, John? Well, what's that like? So this is an interesting, um, an interesting thing, the wide versus long, because, you know, I come from, you know, mainly an Excel slash status slash SAS background. Right. And so that's like a spreadsheet way of thinking and it's become, but that's not the way our, and even a tool like Tableau, that's not the way they work. Right. I had, I had.
00:54:41
Speaker
an early experience with Tableau where I was trying to make a simple map. Just for myself, I had like, I had every state and I had like 15 variables that I wanted to be able to create a map and then just toggle through and just like be able to select the different variables. And the data were set up wide, which is, you know, for those who kind of don't know what we're talking about. It's like every row is a state.
00:55:01
Speaker
and every column was a different variable. And so for me, that was like the natural way to think of things. And I dropped it in the tableau thinking it would be, you know, drop two, you know, slide two pills around and be all set. And it turns out that it was like a five step process, which like still seems ridiculous to me, but okay, that's a different complaint. But if I had reshaped the data to long and put it in that way, then it was like a two, a two step is super easy the way I'd inspect.
00:55:26
Speaker
So the fact that our JavaScript as well, Tableau, that it's more of an array, a vector approach, rather than like a spreadsheet approach, I think that for me is a learning hurdle that takes a little bit of time to get over and thinking about that. So it's just a different way of thinking. I'm sure people have argued about which is right and which is wrong, but it's just a different way of thinking and just required a little bit of mental gymnastics on my part.
00:55:55
Speaker
Yeah, I think you're kind of hitting it there. I think this is one of the great things that are for data science, which is Hadley Wickham and Garrett Goldman's book, which is available for free online. If you just type in R the number four DS, it actually starts with visualization. And I think this is kind of one of the biggest advantages of starting with visualization is that they kind of, it's a good way to demonstrate this wide versus long thing. And so that's when it becomes really apparent that there's a difference in approach and that there's advantages to this approach.
00:56:25
Speaker
And the thing I just kind of always tell people when they reach this point is it's like, what is your unit of observation? Do you care about households? Do you care about people, maybe in the context of urban? Or do you care about games, like in the hockey data set? Or do you care about goals? Do you want a goal to be each row and to kind of put it into context instead of just
00:56:49
Speaker
Well, it's wide, it's long, it's wide, it's long. I find throwing those words around isn't always super helpful, especially since there can be many different versions of wide and many different versions of long in one data set. Yeah, I know Hadley named the spread and gather functions because wide and long are ambiguous. But I think spread and gather are also ambiguous, too. It's hard to come up with a term that's specific enough, or I guess it's just always context dependent is maybe the takeaway given your data.
00:57:16
Speaker
Yeah, well, and spreading gather also verbs. But you know, I guess there have been situations where you may spread more than once, right? I mean, yeah, you can do it multiple times. So at least it leaves it somewhat ambiguous in a good way. So on this to get to your guys, this project, did you guys gather the data and get it into a row is a goal format? Um, that's a good question. I think we I don't
00:57:43
Speaker
I don't remember. I mean, we must have to make some of the visualizations that we had. To make some of the visualizations, yeah, we must have. Well, yeah, we must have, because I can look for. Well, while he looks for that, I mean, I guess. We definitely worked on that. We definitely worked. Yeah, we definitely did that. Yeah. Because we had to flip it around to be able to do the GD plot. Yeah. But there were also questions that we could answer without necessarily doing that.
00:58:10
Speaker
which we certainly did before we gathered it. So if you care about questions about the team, then it's sensible to leave it in that format until you start caring about questions that deal with the goals. So you tackled almost some low hanging fruit with the maybe non-optimally structured data and then did the gathering. Yeah.
00:58:33
Speaker
That's interesting. I imagine I, if you guys remember, and it's okay if you don't, the, I imagine the time where Aaron was like, let's make this long and let's do that with this gather command. And here's how we're going to have to walk through like what we're defining each of these inputs to the gather command. I imagine that must've been a really interesting conversation. That's always one of the toughest things to teach, especially if you want to use the minus sign,
00:58:59
Speaker
you know, to withhold certain variables. I think when I teach that, usually I just say, let's spread it or let's gather all the variables. And then I say, you know, and then I kind of show them the data. It's like, ah, what's wrong with this, right? Now let's use this minus sign to kind of pull out the variables that we don't want to perform this operation to. You're exactly right. That's one of the toughest things to conceptually explain to people when they're learning R.
00:59:23
Speaker
When you explain that, do you explain explicitly the idea of a key value pair as Hadley and Garrett do in the book? Or do you sort of implicitly explain that just by showing them what's happening? Yeah, I definitely explicitly do it. And I'll come up with, you know, like on paper, I'll kind of write out a toy data set with, you know, three rows and three variables, you know, with A, B and C and one, two and three. And then I'll kind of show the transformed data set, you know, just all with pencil and paper.
00:59:53
Speaker
before we do it. And then from there I pretty much, this is always a great time to explain the question mark tool in R so that people can just remember, gather, put a question mark in front of it and then you'll get good examples of how it works because when I was learning R, I know I referenced that help document just over and over and over again and it took me a long time to actually
01:00:14
Speaker
memorize and internalize that. John, do you remember, I just got a few more questions. This is fascinating to me. I want everyone to be able to have the privilege to be able to do this with, I don't know, maybe Aaron, you're going to start getting a ton of emails.
01:00:29
Speaker
You're going to be the Sal Khan or something of R, where you're on YouTube all the time explaining R stuff to people. But John, do you remember, was there a time where you felt stressed or overwhelmed or anything like that? Were there spine-tingling moments where you had cold sweats? No, I don't think so. I mean, like I said, I think Aaron's really good at recognizing when I was hitting a roadblock.
01:00:58
Speaker
look, I sort of like took this on myself, right? Like I sort of like asking you for the pain. It's like, you know, I need some help. Can you help me do this? Um, so I, I kind of, and I knew what I was getting into and I knew why I wanted to do it. Right. So for me, I didn't really have that, like, like I said earlier, like there was no group around. So like, if I got stuck, I didn't feel bad because I had asked Aaron to come in and help me learn this. And, and I think,
01:01:22
Speaker
hopefully by setting my list of things I want to learn, I was sort of setting expectations, right? That like, look, I want to start with setting a working directory, right? I want to start with what's the difference between an R script and an R project, right? Like how do you read in the data? How do you write the data, right? So I was really trying to set the stage of like, I'm starting from the very bottom here and we'll work our way from there.
01:01:46
Speaker
I mean, there were definitely points where I was like frustrated, but you know, frustrated in a good way where like it forces you to think about, okay, I'm going to solve this, try to solve this problem. But I don't think there were any points where I was like, you know, throwing a towel on this. You know, I'd say that we didn't get to everything now having worked in R for another several weeks after that. You know, there are things that have come up that like, I would probably add back to tell us like, I really wanted to start making maps. So that's like something that I probably would have added to the list.
01:02:15
Speaker
Aaron and I are going to talk in a little bit about some exporting things into different formats that we glossed over because we wanted to hit more important things. But there are some of those things where it's like, oh, in retrospect, we should have just figured this how to use the Cairo function to get the PDF writer to work. Little things. And at the time, I didn't know. I didn't know I really needed that to work.
01:02:39
Speaker
But there are probably some other things. I should probably go back and make a list. But maps is certainly one of the things that is just different enough from a bar chart or a line chart that, I mean, it is a separate piece to learn. So I ask if you got frustrated or had cold sweats and you say, pretty much no.
01:02:56
Speaker
Like, that is an unbelievable testament to this process. Like, I dare say there is no one that has gotten from a messy data set to insights and graphing and summarizing without knowing much R that used like a book or an online course that didn't at some point think like, this is awful. I hate this. I'm dumb. This is stupid. Like, that never happens, right? So I think that's like the test. That's like the beauty that's here of this method almost.
01:03:26
Speaker
Yeah, yeah, I think that's right. And I think it's what you need to make it work is, I mean, obviously someone who knows what they're doing. So you need an Aaron, but also an Aaron who understands when to not let someone get too frustrated. And also you need, you know, on the other side, I think you need the person who is learning to be committed to it.
01:03:50
Speaker
Like I wasn't walking into this being like, it'd be nice if I could figure this out in two days. Like, eh, if I don't, no big deal. Like I was committed to being like, I have messed around with this for too many years. I need to finally just get this and also to have the right expectations, right? Like if you come into a class like that, and I, I'm sure all three of us have had students like this, right? Where people come in and like, by the end of this three and a half hour workshop, I want to be.
01:04:15
Speaker
an expert in whatever the thing is. I want to be the machine learning expert. That's not the way things are. At the end of the two days, it was like, okay, I am now on the toddler. I can't read, but I can do the alphabet. That's the sort of thing where the student needs to have the right expectations about what they're going to get out of it. It's not
01:04:37
Speaker
100% of the answers, but enough that you could start knowing the letters enough to piece them together to be able to read later on. So it definitely takes both sides of it. And I've had a few people email me to say, I want to try this with a client or this or that. Do you think it will work with two people or work with five people? And obviously, I don't know the answer to that. I think it depends a lot on the personalities. Well, obviously, on the backgrounds of the people, whether they have any coding background.
01:05:07
Speaker
I'm not sure this works for someone who's never coded. Like, I'm not sure two days works for someone who's never coded. Maybe does. I don't know. Take a lot longer, right? Certainly someone who's never worked with data before would take a lot longer. But if you had five people in a room who had some coding background, you know, you had five me's in a room who
01:05:27
Speaker
had some coding background and wanted to, at the end of the two days, be like, I want to be able to handle myself in this regard. You know, maybe it would work. I don't know. It's hard for me to say. I think there's a lot that goes into it, but I'd be curious to hear if other people are trying these sort of sprints and what their experience has been because it's not scalable.
01:05:47
Speaker
Like, there's no reason the MOOCs are MOOCs, right? Like, you can't have Andrew Tran sit down with 15,000 people one at a time and teach them an eight-week course. Like, that's just not going to happen. But is there a middle ground somewhere? And I'm sure there is a question of where. Yeah, and I think it definitely matters on the dynamics between, if you had more than one person learning, the dynamics both between the teacher and the student and also the other students.
01:06:13
Speaker
because you want everyone to kind of be at ease. You don't want there to be any judgment. You want there to be curiosity and a willingness to try things out without knowing that they're going to work necessarily in the end. So there were times, you know, earlier when I was talking about the problems we had with creating the regular season and the playoff variable or different things like that where we're trying an approach that
01:06:34
Speaker
Not sure that it's gonna work and sometimes it doesn't work and that's okay. You can learn something from seeing why things don't work, but you have to be comfortable kind of pursuing those things, pulling on those threads. And I think the second you lose that, you're definitely not gonna be as successful in this type of environment.
01:06:50
Speaker
That's really, really interesting. One thing that I'm sort of realizing now, like an hour into this conversation, you guys sat together for a while, it seems like. Like usually like sitting side by side from, let's say, what, 10 to five on Monday and then another six, eight hours on Tuesday, like engaged in deep data work. Is that true? Yeah.
01:07:13
Speaker
Yeah, that's right. With a break for lunch. Don't forget the lunch. We had to go get some tacos. But you got lunch together? Or were you like, Aaron, John, I'm going to put my headphones on over in this booth. I'll see you in a half hour. No, no, we went out for lunch together. But I think, actually, when we did it, we were, maybe not consciously, but we definitely weren't doing a lot of our discussion.
01:07:37
Speaker
I think we were both kind of like, okay, we need, you know, we need an hour here or whatever it was of not thinking about, about R for, for just a little bit. So, um, but yeah, yeah, it was, it was two, two full long days. I wasn't going to let him get out of lunch. So I wanted my tacos. Yeah, that's right. That's right. I love the idea of like you work really hard and then you go to lunch and then you like take a bite of your taco and they're like, what was that command again? And then you should keep it going. You just, there's like,
01:08:07
Speaker
sour cream spilling all over your computer that's open while you're still working over. I mean, everybody needs the break. Everybody needs to understand that this is not easy stuff and both the student and the instructor need a little bit of break.
01:08:23
Speaker
So there's two other things I would love to talk about and we can add anything to the list that you guys have. One, I would love to hear John and Aaron, since you're in sort of a support role, what you've done since these two days, because it seems like spaced repetition is a real thing in education. And even if you knew something during those two days, if you go back to your more comfortable tools, which are probably the most efficient to get any immediate task done for you,
01:08:48
Speaker
you're probably going to just forget those things. And from following you on Twitter, it seems like you've kept up your learning really, really well, John. Yeah, thanks. Yeah, so I'm working on a big new project, and I've needed to create literally countless number of graphs. And so I did, especially the first few weeks after we met, I kept it up, and I was learning how to read in different data sets and create different type of graphs.
01:09:18
Speaker
Urban also has a theme that that can be read in that Aaron helped develop He can he can talk a little bit more about that So I was learning how to read that in and how to apply it and then some of the stuff is just minor But you know stuff that makes graphs, you know look good. So some of the smaller pieces I mean, I think that's the big thing that I had to figure out like okay Here's the list of characteristics of a chart, you know, which ones am I turning on which ones am I turning off? so there's all that and there's a lot of Google searching and I basically have like

Challenges with R Outputs

01:09:46
Speaker
I have in my home office, I have Hadley's book with the cheat sheets from Aaron sitting there. And I have two other books that I'm now blanking on the names, but I'll get them so you can put them on the show notes page. So I have those. And then every once in a while, I have some issue that I'd email Aaron and be like, hey, I know it's 1 in the morning, and you're not actually having fun, but I'm coding. Can you help me with this or that?
01:10:15
Speaker
then I started getting into maps and started trying to play around with different types of maps. And that has been, I'd say a little bit more frustrating because the mapping stuff seems to be kind of all over the map. It's just, you know, there's just a lot of different approaches and techniques and JSON format, file formats and all this sort of stuff. So, you know, I'd say that has been more frustrating. Like I think if we had added
01:10:38
Speaker
you know another four hours or whatever to focus on enough four hours but you know another couple of hours to focus on maps that that's something that probably you know in retrospect would have added to my list but yeah i've tried to keep it up and it's you know sort of ebbing and flowing in terms of what i have time for i'm working with the designer on this particular project and the designer wants everything in excel and i was like
01:10:59
Speaker
Like, now I can create all this stuff into R, and I got to go back to Excel. So that experience has been interesting on its own. But also learning the different output formats in R, it's been interesting. So yeah, I've tried to keep it up, but it's challenging, and it's frustrating when you're like, oh, man, I can make this bar chart now so easily, and I want to make this cartogram. But it's not the same. It's just a little bit different. But it's all good. It's all fun. It's all fun and games until someone doesn't get the genome, right?
01:11:29
Speaker
Yeah, that's a great line. I've been amazed kind of with John's persistence and his self sufficiency. So, you know, we finished up on a Tuesday and I went on vacation and a couple days later, I was getting emails at one in the morning, you know, with maybe 20 lines of code that he had written and said, you know, I can't figure out how to do this one thing. And I'm looking at those 20 lines going, wow, you know, you're really
01:12:09
Speaker
the word the name, but I didn't know like how you do it or what it really what it is. And that was something that he added at the end. And I think is a is something that, you know, I would definitely add to any sort of intro lesson into our Yeah, and it's really good if you're learning or you can use our markdown.
01:12:13
Speaker
able to do it
01:12:25
Speaker
the student can, right, to kind of gather all the thoughts and walk away with an artifact that you can reference after the fact to kind of fill in the gaps from maybe what you did to master what you forgot. But it was great seeing John do all this and getting emails about R01 in the morning. I'm a little jealous. It's like, why am I spending time with family? I'm kidding. Um, I think this isn't the answer you're looking for. But I'll answer the question a little differently.

Mastering R through Teaching

01:12:51
Speaker
And I think
01:12:52
Speaker
Your right space repetition is really important. And I think that's one of the things that I love about teaching is that for me as a student of our trying to become a better our program and it's really nice to kind of always have people contacting me about the fundamentals because then I get to just keep going over the fundamentals, keep explaining them in different ways and really master them.
01:13:11
Speaker
And that's great for the things that I already know. And then I get lots of questions about things that I don't know. And then that's an opportunity for me to learn something I don't know to a level where I can explain it to people. And so for me, that's been incredible for learning R. And I think if you're someone who knows some R, who wants to take it to the next level, finding people to answer their questions is a really valuable way to learn. Yeah, this makes me want to start something
01:13:40
Speaker
Like we have our user groups and we have local communities, but like a, like a Uber for R or something, you can go on there and find someone within a five mile radius that will commit a half hour to come over to your place and help you debug your code. And you just have to offer them coffee or something like that. That could be a really good useful tool for some people. Yeah. I mean, I think, I think it is a useful tool. I don't know if there's a, there's an app for that, but, um,
01:14:08
Speaker
I think certain people's learning styles work. I'm sure a lot of people learned R from Andrew Tran's course or from Roger Pang's course or whoever, right? I'm sure lots of people learned how to do R and making great things. That was their start. And for me, just in the way I learn and my schedule, it just didn't work for me. And so hopefully the message that comes out of this discussion is that if there are people out there who want to learn R,

Personalized Learning Styles

01:14:38
Speaker
And, you know, for whatever reason, the those sorts of classes don't work for them. And it might just be you find someone find an Aaron or find Aaron. And bring that person over and you might not need two full days like I wanted two full days because I wanted to get through the data.
01:15:01
Speaker
uh reading exploration side and into the gg plot right like but some people might not like gg plot may not be the data this part may not be near the top of the list and so maybe a day is or two you know whatever it is is will work for them once they can do the data part the analysis and the and the cleaning and the munging then you know then gg plot the the visualization part comes later for them and and i think that's you know obviously that's fine so
01:15:27
Speaker
I just think that, as you've mentioned a few times, there's just different learning styles and there's not a one-size-fits-all and it's for professionals who want to learn these different skills. It just may require a different approach.
01:15:40
Speaker
Yeah, that's, that's awesome. Can we, the last thing I would love to close with is I think we've hit on a bunch of variety of points. Is there any, can we sort of generalize this in some way? Like what, what to do and what not to do, like lessons learned. It seems like John, you came in with a project was really helpful. What other are the sort of five, four or five principles to, to remember if you want to try this at home? Yeah, that's a good question.
01:16:05
Speaker
Or Aaron too, from the expert standpoint, if someone comes to you and says, I want to learn R, what should you do or what should you not do? Yeah, so I think kind of have two answers to that question. One is if you're going to do it in this format, there's kind of one approach. And if you're going to do it on your own, there's a different approach. I think in this format, from the instructor's point of view, you want to have enough material that there's never a lull, but you don't want to be so wedded to that material that you're rigid.
01:16:30
Speaker
You want to put yourself at ease and the other person at ease. You want to be comfortable, be willing to explore. The biggest thing is you want to teach that person how to answer questions for themselves. So teach them enough R that they're able to ask the right questions because ultimately, if you're a good R programmer, you're going to spend most of your time
01:16:50
Speaker
answering the questions that haven't been answered. You're working to get out onto that frontier where you don't have all the answers. And I think that's the most fun part. So teaching people how to tackle that is really important. And then pointing people in the direction of really good resources.
01:17:05
Speaker
So I mentioned the R for Data Science book. I think that's the best place to start. It's such a clear example of how to use these tools. And if you master that book, you'll be able to do so many things that are rewarding and valuable for employers. I think the cheat sheets on RStudio, all the resources on RStudio, they have great videos, are really useful.
01:17:27
Speaker
I think the R community is really great. And in particular on Twitter, I think it's one of the bright spots

Community Support and Resources

01:17:33
Speaker
of Twitter. If you look at the rstats hashtag, you can really get access to a lot of interesting people who are very generous and willing to give you their time. We have internal resources. I think John tweeted out a link to kind of our internal R website. And we've developed a couple of different tools for data visualization. So the first one is urban themes and urban mappers.
01:17:54
Speaker
I like to point people in the direction of resources they can use to answer questions that they haven't even thought of yet.
01:18:00
Speaker
Yeah, I think from the student side, defining what you want to learn out of the sprint is really important. What are the most important things for you to learn and to recognize that maybe the person that you're working with may not know all the answers right off the top of their head, especially if you're working with the data that they've never seen before.

Goal Setting and Projects

01:18:21
Speaker
Understanding that when you come out of a sprint like this, that you're not going to be an art expert, but you're going to need to continue
01:18:29
Speaker
to work with the tool and to continue to develop things. I think having a project that you can work on is really valuable. It doesn't have to be, I mean, you know, the NHL data we're working with is just for fun. It's innocuous stuff. It wasn't, and it was just something I was interested in. I think that can be really helpful for people.
01:18:45
Speaker
I guess I would just reiterate the biggest thing of being open to struggling with the tool a little bit and understanding that struggling is good, trying to figure out the answer is good, but that there's a point where you need to turn to the person you're working with to say, look, I'm not quite getting this. Can you help me here?
01:19:06
Speaker
I think that's the way to go about doing it, really. Yeah, I'll add one thing to that, and we weren't deliberate about this ahead of time, but I think it's something you should be as a student, is finding some kind of commitment device, finding some way to force yourself to revisit the material, like you talked about with space repetitions. So for John, maybe that's every time he sees me in the hallway, if he hasn't done enough R programming recently, he'll feel shame. Maybe there's a healthier way to do this.
01:19:35
Speaker
For junior people here at the Urban Institute, something that seems to be successful is
01:19:39
Speaker
Promising their boss that they'll do something that they don't quite know how to do yet And having that deadline and being on the hook for that, you know, it's good motivation to Keep revisiting our but maybe before you even enter it when you set those clear expectations for what you want to learn I think you should also yeah set like a clear expectation for what you want to do in the month or two after the sprint and then and have some kind of mechanism for
01:20:04
Speaker
forcing yourself to keep doing it.

Choosing a Learning Partner

01:20:07
Speaker
Because that's really how you master these tools. That's unbelievably cool. Like this is so cool. I think this is the cool thing. I love this so much. The other thing you guys have sort of said, but explicitly name is something that stood out to me just listening to you guys talk is it's the relationship you guys have must be so important.
01:20:28
Speaker
And it seems like you guys have a really good relationship and a really good dynamic. And I just feel like that must matter so much that you would not want to do this with a colleague that you didn't like, for example. Yeah, absolutely. Absolutely. Trust. I think it might also be the case that I don't know if I would hire someone that I didn't know.
01:20:47
Speaker
Yeah, I don't know maybe I would I guess I guess it just depends obviously depends like the fact that I know Aaron I know his dedication to our and that he's you know I've seen him speak and I've seen him teach and I know that he's good at all of it so I felt comfortable saying you know can you help me with this whereas if I called up
01:21:04
Speaker
learn r.com or, you know, whatever, right? Like, um, and they were going to send a person out. Like I wouldn't know that person beforehand. So the Uber version of this may not work. You might need the Uber version where you can interview the person before you actually have the lesson. Maybe that's what you need. You know, there's a clearinghouse part to this. I'm not really sure. So, but
01:21:23
Speaker
But I agree. You're spending 16, 20 hours together sitting down next to each other, working on something. And if you don't like the person in either direction, it's going to be a long haul. Yeah, I think that's fair. Wonderful. That's all I've got. Do you guys have anything else you want to talk about or? I'll plug one thing.
01:21:46
Speaker
So if you're interested in the work that we do sort of building in our community here at the Urban Institute, we'll have a blog post coming out about that in a week or two.

Upcoming Blog Post on Teaching R

01:21:55
Speaker
We have a medium site called Data at Urban where we sort of highlight all of the more data science type work that happens at the Urban Institute. And that'll just sort of walk through our approach to teaching everyone here at the Urban Institute R and helping people use R to improve their work. Really cool. Anything you want to plug or say, John?
01:22:12
Speaker
Uh, no, I don't think so. I mean, I think this has been a fun, uh, conversation and we should just, um, I think people should, you know, whether it's R or Tableau or JavaScript or Excel or, or whatever it is, I would just encourage people not to give up. Um, I've been guilty of that. And I think this approach that we tried worked for us. It may not work for everybody, but, um,
01:22:30
Speaker
Part of the enjoyment is actually the struggle, I think, of trying to figure it out. But you can go too far with struggling and getting frustrated. So whatever the tool is, I think just find the way that you learn best and try to find someone who can help you do that. Yeah, really good.
01:22:46
Speaker
And the next podcast we record together should be a fully hockey podcast. And we can figure out, one, why do teams score so much at the end of periods? And how did this video game company reflect it in their product? And how is your son so good at it? Yeah, that's a big question. He's not going to want me to answer that question.
01:23:06
Speaker
Um, yeah, yeah, that's great. That's great. Well, thanks so much for having us. Yeah. Wonderful. Really fun to, really fun to chat. And yeah, just, I think this is really cool. I hope other people experiment with new ways of learning are like finding an expert and sitting down with them for, for two days. So this, this has been really fun. Thank you guys.
01:23:23
Speaker
Thanks everyone for tuning into this week's episode. I'm really grateful that you tune in every week to listen to the show. I'm also very thankful for the guests who come on and agree to chat with me about data and data visualization and presentation skills. If you'd like to help support the show, and I hope you will,
01:23:40
Speaker
Consider leaving a review of the show on iTunes or Stitcher or your favorite podcast provider. Or maybe consider financially supporting the show by going to the Patreon page and donating to the show as little as three bucks a month to help me pay for transcription services and audio editing and all the things that are needed to put the show together.
01:24:00
Speaker
or you can always just keep listening to the show every other week when a new episode is posted. So thanks again for tuning into this week's episode. This has been the Policy Viz Podcast. Thanks so much for listening.