Introduction and Sponsor Mention
00:00:00
Speaker
This episode of the PolicyViz podcast is brought to you by Juice Analytics. Juice is the company behind Juicebox, a new kind of platform for presenting data. It's a platform designed to deliver easy-to-read, interactive data applications and dashboards. Juicebox turns your valuable analyses into a story for everyday decision makers. For more information on Juicebox or to schedule a demo, visit juiceanalytics.com.
Guest Introduction: Lynn Turney
00:00:35
Speaker
Welcome back to the Policy Viz Podcast. Happy New Year, everybody. I hope the kickoff to 2016 is treating you well. I'm excited to continue the Policy Viz Podcast this year with a very special guest, Lynn Turney, data consultant, analyst, data visualization specialist, and currently the visiting Knight Fellow at the University of Miami. Lynn, welcome to the show.
00:00:57
Speaker
Hi, glad to be here. Thanks for coming on. I'm guessing it's warmer down in Miami than it is up here in DC.
Teaching Data Visualization at University of Miami
00:01:05
Speaker
But I want to talk about sort of two main threads today. You've been down at Miami now, starting your second semester there, teaching data visualization to, I would guess, primarily journalists.
00:01:17
Speaker
Yeah, well, the first class was journalists. It looks like the second class this semester is going to be more mixed. It's MFA students and journalists. So I want to start by talking about what you're teaching and your sort of approach to teaching data visualization, which is always a big question in the field about how do you
00:01:34
Speaker
get people to be good at visualizing data so quickly. And then I want to turn a little bit to talking about some of your previous work on visualizing texts because I think that's a big challenge for lots of different people. But let's start about teaching data visualization. What's your approach? What are you trying to get people to understand in 14 or 15 weeks?
00:01:56
Speaker
Essentially, I was brought down to teach interactive database, so database for the web. We already have Alberto Cairo here, who is well known for his static infographics. He recruited me to come down and add an interactive component to the coursework.
00:02:13
Speaker
So I've been teaching primarily D3 because that's the most flexible tool. It's been kind of an uphill battle because not all the students are adept at JavaScript, but I still think it was fun for all and everybody tells me they learned a lot and their projects certainly show they learned a lot.
00:02:33
Speaker
And so when they come in to your class, what do they have sort of in their toolkit already? Have they taken statistics? Do they know how to work with data generally? And now it's about visualization or you have to build up sort of all of those skills.
00:02:47
Speaker
Yeah, good question. That was sort of my biggest question when I started putting the course together. The journalism students have had Alberto's class, and so they have some sense of data analysis and data storytelling. The students I had from other departments, I had a couple, they don't necessarily have strong stats or data analysis skills.
00:03:10
Speaker
Yeah, there were a few folks that I had to have the like why pie charts aren't the best display for lots of categories conversation, but a lot of them already had a sense of what was a good visualization method. Yeah, but they don't necessarily have any stats. So that's, that's a different question.
00:03:29
Speaker
So what's your philosophy to the approach? Do you give them a bunch of data and say, we're going to go make some visualization? Or do you say, go find some data? Or I'm going to give you a messy data set. You need to clean it up. What's the general philosophy or approach over the course of the semester?
Working with UNICEF Data: Challenges
00:03:44
Speaker
Well, I kind of inherited it. Alberto had set up a client for us, which is UNICEF. And so in a way, that was good for me because it constrained the data arena and the problem space a lot. But it was also bad in the sense that I had to come in and pick up this client relationship and figure out how to deliver for them.
00:04:06
Speaker
They gave us quite a lot of freedom, obviously, because we're doing it for free. But we did have to deal with that data and trying to find complete data from various countries in World Bank, World Health Organization, and the UNICEF data sites.
00:04:21
Speaker
So that was a challenge. There wasn't always good data longitudinally for some countries. So students were occasionally frustrated by not having exactly what they needed for the story they wanted to tell. But at least mostly it was online in Excel files. And I gave them a bunch of quick tips on how to use Excel to get it into the right CSV format for visualizing.
00:04:47
Speaker
So they're trying to merge these different data sets together and then they're bringing them into JavaScript. So what was that like trying to teach journalism students who may not have sort of any, I would presume, much or any coding background?
Teaching JavaScript and D3 to Beginners
00:05:00
Speaker
Yeah, so they came in with HTML and CSS because they had to have some baseline in web design. It's just too much to teach otherwise. I had given them tips on how to learn some JavaScript over the summer. So before the semester started, they were theoretically supposed to have gone through some online courses.
00:05:19
Speaker
So they had some knowledge of JavaScript and some of them were taking a concurrent class in intro to JavaScript, jQuery. One of them was in a processing class, you know, P5.js class. But it was definitely a challenge to do the data manipulation work in JavaScript. I kept having to assign little like, here's how you would debug some JavaScript and here's how you would merge data sets. Here's how we clean it.
00:05:48
Speaker
Here's how we check for empty values, things like that. There was a parallel track of trying to bootstrap everyone's knowledge of JavaScript along with the learning D3 and chart making and interactivity. Are you mostly creating things in the class and walking them through it or are you saying, we're going to do this thing today, I'm going to show you a quick demo and then you're just going to go in and do it and I'll walk around the room. What's the approach in the classroom itself?
00:06:17
Speaker
Yeah, that's a good question too. Well, I structured it so that I covered things in D3 that I think are important for journalistic visualization. So that means I focused on starting bar charts, line charts, time series, area charts a little bit, but things that are much more journalistic visualization type, so not any of the wacky or D3 layouts and
00:06:40
Speaker
chart types, like we didn't do networks or anything. I showed them at the end links to other layouts, but I stayed with sort of basic statistical charts. I did make a dot plot example for them, etc.
00:06:53
Speaker
You know, each week we would do another type of visualization using real UNICEF data based on an example or two that I had thrown it together over the weekend. And the focus was always on interaction. It was not just, here's how you make a bar chart in D3, because you can use other libraries for that much, much faster. It was, if we have a button here and we want to animate this bar chart,
00:07:17
Speaker
Here's how we do that. Or if we want to bring new data in and join it to the existing data and animate that, here's how we would do that. Here's how we do a scatter plot where you're zooming in on the top 20 in the upper right. So it's very much about what are the interactive techniques that will get you somewhere interesting using D3, not things that you could do easily in other simpler libraries.
00:07:42
Speaker
But yeah, they did use my examples. I mean, the whole first three quarters of the class was take my example and do your own with your own data, hopefully a focus that they
Class Structure and Student Projects
00:07:53
Speaker
could then use in the final project. So I kept saying, you know, make sure you hone in on a topic in the UNICEF data that's interesting to you, like maybe a country focus or something or a specific illness problem.
00:08:03
Speaker
and make your weekly examples using that kind of data so that then it'll be easier to put your project together at the end because it'll be combining things you've already done. That didn't always work so well because they weren't sure for a long time what their project theme was going to be. They definitely didn't come up with a data story in the intercept data for quite a long time, which was a big problem actually.
00:08:28
Speaker
So it's sort of interesting, some of the things you described sound like your sort of classic UI-UX challenges, but on the other hand, they're pretty specific. So were there specific texts or articles or resources that you had them try to use? As far as I know, it doesn't seem like there's a text that sort of does, you know, sort of specifically to what you were trying to accomplish with these students.
00:08:50
Speaker
No, and it's particularly a problem because I was after, like I said, focusing on interaction and a lot of the D3 books and online tutorials are about very simple, just building this kind of bar chart.
00:09:06
Speaker
There are a few tutorials, so Jim Volandingham and Nathan Yao have tutorials that are great, that are a little more focused on news journalistic style interaction. And so I adapted several of theirs. And when I say adapted, I mean
00:09:23
Speaker
In some cases, in Jim's case, some of his code is pretty advanced. These are new programmers. So I sometimes simplified some of his code, which wasn't always a good idea. It would turn out later. Sometimes you just have to explain how things work so that they can see the benefit later on eventually when they're ready for it.
00:09:41
Speaker
And Nathan's stuff, I had to update it a little bit. Some of his examples are a little older on his website, like old versions of D3 or whatever. So I cleaned up and updated and then had to use real UNICEF data in all of them. So there were tweaks for that as well.
00:09:59
Speaker
And what about the pieces of coding that's necessary, but may not be the most fun things? You know, it's not the actual creation, but it's like debugging and setting up a server or getting things onto GitHub. Did you sort of walk through that? And what was that experience like for this group of students?
00:10:18
Speaker
Yeah, so the first class was the most horrifically boring ever, and I'm already afraid. Well, this next incoming class, they seem to have more computer experience. But for these guys, I really, the first half of the class, after I said this is what we're doing and what the UNICEF report currently looks like,
00:10:38
Speaker
Now we're going to set up GitHub. Now we're going to set up a survey on your machines. And literally, I'm not leaving this room until you're all running a server locally. Because that had been a problem that Scott Murray and I had in the online D3 course that he was teaching. And I was TAing. Lots of people struggled with just getting their server installed and running. So yeah, there was definitely a lot of setup initially. But the students coming in this next semester have already been working with GitHub more.
00:11:06
Speaker
And the reason to use GitHub is because, well, not only is it a job skill, but also it's easier for them to get my updated course materials because they just pull the latest version every week and they can post their own stuff publicly, which then become portfolio materials. And my real goal in this class is to give them portfolio materials. It's not so much they get an A and get to a grad school. It's they have project work and portfolio work that looks really good to the outside world that that's really the goal.
00:11:36
Speaker
All right, interesting. So now that you're starting the next semester, are there any sort of like wholesale changes you've made to the class or is it more sort of tweaking around the edges?
00:11:46
Speaker
So there's a few changes I have to make, and I'm still doing them during the break. One is I underestimated how hard it is for people who aren't great at JavaScript to learn how to structure their code. And this is also a problem with almost all of the online tutorials for D3. They're all very simple, and that includes Mike Bostock's blocks. They're all simple, not a lot of functions, not a lot of structure to them.
00:12:15
Speaker
if you're doing a big project, you need to organize your code. So it was pretty obvious in preparation for the project, three quarters in, I said, okay, now I want you to just make a page that has three of the charts you're going to use in your project on the same page. And it almost killed them because they had all these variable conflicts and they couldn't figure out how to find the problem, et cetera, because it was all just one big global no functions anywhere.
00:12:40
Speaker
mess. So that's my fault. I definitely should have shown them earlier how to structure things so that it was easier to debug. So that's one thing. I'm going to make sure they know something about using the command line earlier. I didn't teach that, and it kept being a problem. Just because you need to do lots of little things at the command line, especially with Git.
00:13:03
Speaker
I'm going to do more non-journalistic examples and have a freer choice of data topic because not all of the students coming in want to work in journalism. And it's true, not every data visualization person becomes an interactive journalist. They go and they get jobs like you or like me consulting places.
00:13:25
Speaker
So learning a little bit about how to make dashboards, what that means, tools out there for structuring your code, although I'm not going to teach any of the JavaScript frameworks because we just don't have time in this class. Give them a little bigger picture of the job space and the problems that they might work on with these tools.
00:13:46
Speaker
Yeah, I mean, at least being able to, should they go to a place that they're working on static graphics or they're working on whatever, but at least being able to understand what goes into some of these things. I find a lot of people who don't have experience with, you know, working in JavaScript or making online tools, they just say, well, just go make it. Like, how hard can it be? Just press a couple of buttons and, you know, we all know that it's a lot harder than that. So respecting that and understanding what it takes to create an online visualization is itself, I think, very valuable.
00:14:14
Speaker
Yeah, and I also think that the process of using tools, you know, it makes you have to figure out a good reproducible workflow. Like, you know, when I get an update on my data or I add data, I have to figure out how to quickly incorporate that and update the graphic or whatever. There's some good skills just in terms of work process if you're doing this stuff well. Absolutely.
00:14:37
Speaker
Good. Well, uh, good luck this semester, but before, uh, I don't want to miss the opportunity to talk with you a little bit about, uh, text visualization, because I think it's probably one of the, I don't know, maybe it's one of the more unexplored areas.
Text Visualization Techniques and Challenges
00:14:49
Speaker
It certainly doesn't get a lot of play. I don't think in the field. So I'd like to start by asking you to talk a little bit about, uh, some of the projects you've worked on maybe in the past. Um, I think one of the first projects that I saw you at least present was at OpenVizConf three years ago on some work you're, you're doing.
00:15:05
Speaker
So I'm curious to hear what sort of text visualization you've been doing. And then let me just preface all of that by saying I think a lot of folks that I talk to who are working with qualitative data have a lot of challenges sort of analyzing those data and then visualizing those data. So those are sort of the two pieces I think people are interested in hearing about sort of what you've been working on or have worked on and what you might recommend to do a good job with working with sort of this loose
00:15:35
Speaker
qualitative or text data. Okay, yeah, those are kind of big subjects. Yeah, so I'll just give you two huge questions and just go ahead.
00:15:44
Speaker
Yeah, well, the thing I'm most famous for, probably, is using machine learning to detect sex scenes in Fifty Shades of Grey books. Yeah, I published the code for that earlier this year, actually, because I did a tutorial in an online, like, tutorial set of Python notebooks, and the code for that is in there. And those are pretty popular notebooks. Well, I'll say exactly why, but OK.
00:16:12
Speaker
A classic example for that technique, the Naive Bayes method I used is spam detection, which is just boring. So yeah, I wanted to do something more interesting. Text visualization is a problem for a lot of people because of tools, for one thing. So you can't just jump right into JavaScript yet. There aren't enough good tools in JavaScript for doing text analysis and visualization, which means you have to start in some other tool like Python or R are the most common ones.
00:16:41
Speaker
Well, there are also lots of people who've been doing Java for years and other things. But I use Python and R. And you'd start with those tools and do some basic analysis. So you can look at parts of speech, like nouns in somebody's chapter. And it gives you a sense of what it's about, a simple sense. To do that, you need libraries that can do noun detection or parsing and part of speech detection.
00:17:09
Speaker
So that means installing a whole bunch of stuff on your machine that you don't necessarily have installed and figuring out how to use it. Luckily, there are lots and lots of tutorials for that. So that's not that difficult anymore.
00:17:21
Speaker
Other things are just tokenizing, turning things into words. And what is a word is kind of a debated topic. If you have a hyphenated word, do split it at the hyphen. If you have an apostrophe S, is apostrophe S a separate word? So it depends on the tokenizer that you use. And you have different choices with the tools for that as well. Punctuation, you want it in or out. Capitalization. Capitalization matters for some things, like if you're looking for proper nouns in a lot of, say, English for sure.
00:17:50
Speaker
capitalization matters a lot. So yeah, you've got a lot of choices about what you want to do when you turn a text into something that can be analyzed, and you have to start from those basics.
00:18:04
Speaker
So qualitative data generally, you have kind of the same problem. You should probably give me some examples of what you mean, but I think of qualitative data as things like interviews. Yeah, exactly. Yeah, you've got interviews with some subjects and you need to figure out how to analyze it. So there are all sorts of coding tools out there now. There weren't when I was in grad school, it was sort of a manual process, but now there are all sorts of online coding tools where you can sort of highlight and categorize things. And it's usually a laborious,
00:18:33
Speaker
intensive process where some kind of subject matter expert goes through those interviews and labels and codes things. And then you get into issues like intercoder reliability. Is it just my judgment when I coded this statement as being about whatever? Do other people code it the same way? There's a lot of noise and fuzziness in that.
00:18:57
Speaker
But you need to be able to count something and standardize on something in order to do any kind of data analysis with it. So that's usually where you're, you know, where you're starting from with qualitative data. What were you thinking about? No, that's exactly right. And it is sort of these, you know, we ran a survey, you know, we were doing some evaluate, you know, from the sort of folks, the colleagues that I work with here.
00:19:19
Speaker
They run some sort of evaluation. They're talking to some group that's working with some population in, I don't know, Denver, wherever. And what's your experience with this program or with this government program or participating in this work support program? What are your experiences? And it's sort of funny, as you were talking, from sort of social and public policy research, what people are now sort of talking about more and more is big data and getting data from sort of different and newer sources. Twitter is like one of the big ones people talk about.
00:19:49
Speaker
And getting people to understand how to work with those data is also requiring new tools. But then you talk about qualitative data, which is sort of like we've been asking people questions for 100 years and trying to figure out how to better understand the answers to those. And sometimes you can't just put those answers into buckets.
00:20:07
Speaker
And yet now we have technology that, I don't know, is perhaps making those tasks of analyzing those data easier and more efficient. But far too often, I have already seen that the latter part of 2014, researchers trying to use big or new sorts of data, and they're having a lot of these intercoder reliability issues or not using the right tools or using different libraries, and they're not exactly sure of the tools and the libraries that they should be using.
00:20:33
Speaker
And you're seeing, I think, lots of mistakes being made with new forms of data, because it's exciting to have new forms of data, but not exactly using the right tools and techniques. Yeah, a lot of what you're describing sounds like it's ripe for machine learning for, say, categorization classification. Yeah, the problem there is you need training data, and so that's where you get into the coding problem.
00:20:59
Speaker
I mean, you know, there's an entire category of mechanical Turk task for sentiment assignments, so you can get.
00:21:06
Speaker
whatever your small texts is, if it's a tweet or if it's a review or whatever, you can get mechanical turquers to do sentiment ratings for you so that you can build a classifier to give you sentiment ratings on new text of the same type, right? Yeah. A common, common problem.
Ongoing Projects and Resources for Text Visualization
00:21:20
Speaker
There is noise in all of these things for sure. As soon as you're talking about anything with machine learning, you've got some error that you're going to have to cope with and garbage in, garbage out too. So you have to make sure that your ratings are good. Your classification is good in the training set.
00:21:36
Speaker
And when it comes to visualizing text, what does that usually entail? Is it trying to take qualitative data or trying to take text and quantifying it in some way?
00:21:46
Speaker
Yeah, it depends what you want to visualize. I mean, there's artistic text visualizations, too. And those are things as simple as look at the structure of this. Is it a lot of white space or is it text? You know, very give you sort of an impressionistic opinion of something. There's calligrams, which are the poem's sort of text artwork where the text is shaped like an animal or a person or a vehicle. So those are like the artistic text visualization side.
00:22:16
Speaker
Yeah, in the more statistical realm, it's definitely statistical graphs counting things and trying to figure out patterns in the content. So topic modeling is an exploratory method where it tries to figure out if there are groups of themes.
00:22:34
Speaker
in the documents and which documents have those themes. And again, all it's got to work with there is numbers of words, right? So all you get out is like, this is the set of words for this topic in this group of documents. So at some point you have to count words in almost all of these methods. There's essentially a turn a document into words and then visualize that in some way.
00:22:59
Speaker
And unfortunately, word clouds, which you hate. And I'm a little more forgiving of them. Yeah, I'm not a big fan. Well, they're an odd case, because they're between the data art and the useful. And so, yeah. Yeah. Interesting. And are you still continuing to do some work on text vis now, or are you sort of on hold as you're gearing up to teach this class?
00:23:26
Speaker
No, I just kicked off, well I am kicking off next week, a project on sort of tutorial materials for people who aren't expert programmers to do text visualization of various kinds. It's kind of aimed a little bit at digital humanities people and maybe journalists who are not necessarily strong NLP people.
00:23:47
Speaker
So they'll be able to take the materials, it's going to be an online tutorial website, and try out different techniques using some simple scripts if they want, or hopefully just web browser examples. They'll get better results if they use the script, the Python and R scripts, and then plug that into the web pages.
00:24:10
Speaker
Depending on their level of comfort with running Python or R, hopefully we'll have some just web-based tools for them. I'm doing that with friends at Boku, actually. The funding I have as a Knight Fellow from the Knight Foundation allows me to work on some projects. That's the project I picked because it's the subject I'm personally most interested in.
00:24:32
Speaker
Great. Well, I think people will be looking forward to that. And once it comes out, I'll put that in the show notes so people can check it out. Lynn, thanks for coming on the show. This has been really interesting. And good luck this semester.
00:24:43
Speaker
Yep. Thank you. I'll need it.
Conclusion and Listener Engagement
00:24:46
Speaker
And thanks to everyone for tuning in. I hope everyone had a great break over the holiday season and are getting a good kickoff to 2016. Again, if you have any comments or suggestions, please leave a note on the website and please rate the show on iTunes. It gets to show moved up the queue so that others can find out about the show and listen to it. So until next time, this has been the policy of his podcast. Thanks so much for listening.
00:25:21
Speaker
This episode of the PolicyViz podcast is brought to you by Juice Analytics. For 10 years, Juice has been helping clients like Aetna, the Virginia Chamber of Commerce, Notre Dame University, and US News and World Report create beautiful, easy to understand visualizations. Be sure to learn more about Juicebox, a new kind of platform for presenting data at juiceanalytics.com. And be sure to check out their book, Data Fluency, now available on Amazon.