Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
PolicyViz Podcast Episode #12: Scott Klein image

PolicyViz Podcast Episode #12: Scott Klein

The PolicyViz Podcast
Avatar
140 Plays9 years ago

Welcome to Episode #11 of the PolicyVIz Podcast! This week, I’m excited to welcome Scott Klein, Assistant Managing Editor at ProPublica. Scott and his team do really exciting work with data, by presenting large datasets with journalism. That is, they...

The post PolicyViz Podcast Episode #12: Scott Klein appeared first on PolicyViz.

Recommended
Transcript

Introduction and Guest Background

00:00:11
Speaker
Welcome back to the PolicyViz podcast. I'm your host, John Schwabisch. I am here with Scott Klein, the assistant managing editor of ProPublica, sort of my latest guest on this burgeoning field of data journalism. Scott, welcome to the show. Great to be here. Thanks for coming on. I want to start by talking about your philosophy, ProPublica's philosophy, when it comes to the bigger stories

ProPublica's Investigative Mission

00:00:36
Speaker
that you guys do. It seems like you do a lot of work
00:00:39
Speaker
pulling out data sets that are maybe buried or don't get as much publicity as maybe they deserve and ProPublica seems to put them in these usable, friendly formats and you also have sort of a different take on those formats as far as I can tell.
00:00:54
Speaker
Sure. Well, I mean, you know, briefly ProPublica is a nonprofit investigative news outlet started and based in New York City. And our kind of philosophy as an entire organization were I think about 50 plus or minus people. And our mission as a charitable nonprofit is to do investigative journalism that has a real impact in the world. So, you know, not just raising awareness, but in fact,

Innovative Data Storytelling

00:01:23
Speaker
making change. And we focus on stories we like to sort of say with moral force and stories that other news outlets maybe don't have the time and resources to do in a way that they maybe used to have in the happy days of huge profits. So one of the ways that that translates, you know, there's a wider newsroom
00:01:45
Speaker
I run a small team within the larger newsroom, but within the larger newsroom we have people working on long form investigative stories that can take a year or longer to report. So we kind of really like going after became. The way that plays out in my team is sort of similar. We like taking large data sets, maybe ones that are confusing or people don't understand or
00:02:12
Speaker
maybe especially ones that actually haven't been released as open data, but we get through FOIA requests or through sourcing or through other means, put them together with other data sets that we think will help people understand the data even more. And then what's, I think a little bit new about what we're doing is we actually are presenting the data itself to people as the journalism.

Personalizing Data Narratives

00:02:33
Speaker
So, you know, the traditional model of data journalism is to take data sets
00:02:38
Speaker
To do you know even sometimes very lengthy kind of work to process and analyze and clean the data and then write a story about the data and say you know we looked at this million row dataset and here is the biggest years the smallest years the average years.
00:02:56
Speaker
you know what you need to know and you know here are some anecdotes that help you understand this data and this is kind of it takes the form of a narrative story kind of or a broadcast story in a very traditional role model and what we do on my team is actually take that the next step and actually just show people the data and we tell stories not with narrative but with user interface with
00:03:23
Speaker
hierarchy with design so that people are telling themselves a story just as they would read a story in a newspaper, they're telling themselves a story, but this time not with anecdotes, but with their own information.

Journalism and Data Interfaces

00:03:37
Speaker
So, you know, their school, their town, their politician, you know, how did these things relate to the whole picture?
00:03:46
Speaker
Yeah, and how important do you think it is that people are able to compare their personal experience to the experience of maybe their friends or their family members? So I can go in and look at, you did one on education. I can go in and look at my high school from when I grew up. How important is it for people to be able to say, I can look at my high school and I can also look at my spouse's high school when he or she was growing up?
00:04:09
Speaker
Right. It's a great question and in fact it's absolutely central to the way we do things. So the project you're talking about is called the opportunity gap. There's a project we did a few years ago that looks at Department of Education data about educational opportunities and this is defined as
00:04:26
Speaker
higher math classes, chemistry classes, AP courses, and things like that. And it asks the question, you know, how are these resources distributed? Are they distributed fairly to rich kids and poor kids? And in the old model, you could have told that story, you know, picked three high schools in the country that
00:04:46
Speaker
display certain characteristics that you as a reporter find useful. This one is one where there's a lot of equality. Here's one where there's not a lot of equality. Let's look at their outcomes and things like that. And if you're lucky and very, very lucky as a reader, one of those schools might be meaningful to you. But in the vast majority of cases,
00:05:05
Speaker
You wouldn't have ever heard of these high schools and they may not be great illustrations of this phenomenon to you because it's sort of hard for you to, you're trying to understand this very complex phenomenon with the example of something that's even though simple is quite remote.
00:05:23
Speaker
And what we can do with a project like the Opportunity Gap is to explain this large phenomenon to you by first showing you as a benchmark your own experience. So you can look up the high school that you went to and understand that, you know, I know what this high school looks like, I know what it was like to be at this high school, or it's your kids' school, and you know what the student population is like, and you know kind of how the place feels, and you can say,
00:05:48
Speaker
all right well if that is a school that has twenty five percent free and reduced-price lunch which is our proxy for poverty i think now i can sort of set that as my benchmark and i know that a school that has more free and reduced-price lunch i know that that's a higher what it means to be a higher poverty school what it feels like to be a higher poverty school and one that's lower it can be a lower poverty school more wealthy school so
00:06:09
Speaker
it now becomes something that I can attach to my own experience. And then when I show you through the interface design, the higher and lower poverty schools and how the outcomes are different and how the distribution of resources is different, it's much more meaningful to me and much more visceral to me. And I don't need to sort of struggle to find analogies. I have one built in that's really close to me.
00:06:30
Speaker
And how important is it for you and your team to pair the data portal with this ability to look at my own experience with the stories, with actual talking to people and sort of building in the narrative story around the interface and around the tool?

Challenges with Data Formats

00:06:47
Speaker
No, it's crucial and one thing that I want to make sure that I say here is that we absolutely are journalists on the team and we have phones next to our computers and we use them to call people and we gather sources and we talk to experts and a lot of the times we participate in actually the writing of the story. We just published a story, for instance, about
00:07:12
Speaker
cruise ship safety in which the reporter was the same person as the developer and the designer. So we had in one person somebody doing all of the kind of traditional journalism. We ended up writing a story that we threaded through.
00:07:28
Speaker
user interface that let people look up individual cruise ships to see what their safety record was like. So we had the ability to kind of explain this phenomenon to people and show them examples that maybe it was a cruise they took or that their parents took or somebody their friend took, all at the same time. So it's crucial to us that we have the ability to do what traditional reporters do.

Analyzing Complex Data

00:07:52
Speaker
Right, I wanna ask you a little bit about the actual data collection process. My sort of view of the open data movement is we've moved from a place where we have data wrapped in PDFs that are basically impossible to get. We've now moved to APIs and JSON formats where we have machine readable data and we have other products and platforms out there that's creating data that we can sort of get this machine readable. But now I wonder whether we need to move from machine readable
00:08:21
Speaker
format where you have variable names to a human readable format where instead of you know V1295 it is you know state or location or some it's the variable name and what sort of difficulties what sort of challenges you guys have when you're working with data of these different formats coming out of these different agencies or organizations? You know first of all we have we get a lot of data with have you know V129 and you know we're
00:08:51
Speaker
It's a thing that we confront with almost every project. And, you know, we are happy and lucky if the data comes with a sufficient data dictionary that helps us understand and frankly, lots of government data, either via FOIA or via open data portals actually does come with pretty decent
00:09:10
Speaker
data dictionaries, at least in my experience, that may be clouded by the fact that if they don't come with good data dictionaries, we tend to either not do it or take a long time trying to get one. So I definitely see the issue. I think there's lots of complexity and lots of competing priorities inside the government in terms of who the audience is for this data.
00:09:37
Speaker
I, for one, don't welcome that date because it'll maybe put me out of business, but I'm not all that worried. Well, you still need to pull it out of these formats and get into something where people can use it, right?

Reporters as Researchers

00:09:48
Speaker
So there's that, which is what you guys sort of... That's the core there at ProPublica, right?
00:09:53
Speaker
Yeah, no, absolutely. But I think also data is complex. And often data is shoehorned into kind of forms and formats that are necessarily insufficient to express all of the data's complexity. We did a project about the Form 990, the nonprofit tax return form.
00:10:17
Speaker
And we really kind of dove into this project thinking, oh, we're for sure there's going to be some heuristic we can use to detect corruption, and this will just be kind of an ocean of nonprofit corruption stories. And surely there are corrupt nonprofits, and surely there have been a bunch of great coverage of corrupt nonprofits. But it is not as though there is, as I say, some pattern that you can detect
00:10:43
Speaker
just based on Form 990 and sort of say, well, you know, look for people who spend too much money or look for people who run deficits consistently. You know, Form 990 is a very rigid thing and, you know, it's hard to kind of express the complexity of, you know, nonprofits just by asking, you know, whatever 45, 50 questions because everybody gets asked the same 45 or 50 questions and they have to kind of answer as best they can. So a lot of the nonprofits that we kind of
00:11:12
Speaker
built a model and said, look for people who do this or that, that you would kind of get out nonprofits that were actually just fine. And they follow that pattern because of perfectly legitimate reasons. So the data is very, very complex. It's not just that our labeling has to get better. It's that you kind of need to do the work of understanding the data and talking with the people who built the data, and hopefully that's
00:11:40
Speaker
one of the values that we can bring as journalists.

Training and Expert Collaboration

00:11:42
Speaker
So this data complexity really leads me to thinking about reporters as researchers, which is, as you probably know, one of my comments about the sort of data journalism field. I talked to Ben Castleman in one of the early episodes of the podcast.
00:11:59
Speaker
There's a lot of reporters, I think, or journalists who may not have the experience or the training in statistics or econometrics or what have you. So how do you play with the balance of maybe teaching your staff how to do this research effectively or how to work with statistics, how to work with data? Do you have a training process that you use? Is it communicating with experts?
00:12:25
Speaker
where do you see both both your your team's role at ProPublica and also sort of the wider field of journalism and media now and as we sort of go forward? I mean it's a little bit of all of the above but the most important thing and and this is kind of what differentiates journalists from
00:12:41
Speaker
other disciplines like it, is that we always rely on the wisdom of smarter people. And it is crucial that we talk to somebody, even if we think we understand a data set perfectly, even if we think that we understand the math we've done perfectly, we will always call someone that we think can tell us we're wrong and show them before we've published what we've done, how we've done it, what our assumptions were,
00:13:09
Speaker
We'll show them our code, our code, and say, here's what we did. How did I screw this up? And like is not, they'll say, oh my god, you can't do that. You're impaired, state over state, you can't do that. And I think it's absolutely crucial to our ability to get things right, which is really the most important thing we care about.

Conclusion and Key Takeaways

00:13:31
Speaker
So sure, we've got folks who were
00:13:35
Speaker
economics majors. We have somebody here who went through the SEPA program at Columbia and got her stats chops from there. So we have folks who are trained and who train others. But even, you know, anything past a sort of simple regression and often even a simple regression, there's always an attached phone call to an expert who can explain why the model we built has some flaw or some confound that we're just not paying attention to.
00:14:03
Speaker
And are you finding that those experts are happy and willing to lend their expertise and comment? Yes, absolutely.
00:14:11
Speaker
I mean, it seems like that peer, it's sort of like a peer review process. That peer review process is obviously, well, obviously has its flaws in the existing sort of academic world, but it seems like it's an important aspect of this sort of data-driven journalism that you and others are doing. Yeah, and I think that the times, the few times, thank goodness we've made mistakes, and the times that I've seen other news organizations made a mistake.
00:14:37
Speaker
It's been because we thought we knew what we were doing and I don't have to check. I read the docs. I understand this data pretty well. There's just some simple misunderstanding you had or you've baked into your model an assumption that turns out to be flawed and that a phone call would have ironed it out.
00:14:58
Speaker
Right. And do you ever feel like you're sort of held back a little bit because, you know, maybe you don't have a statistician or econometrician like sitting on staff, you ever feel like, oh, we'd really want to dive into this complex issue, but we're not really, we don't really have the skill set to do that. Either we're not going to do it, or we're going to go find some people who could maybe do the research for us.
00:15:21
Speaker
You know, I always want more people. Sure. I mean, if I could have, you know, a staff twice the size, I would say yes right now. But, you know, I think even then we would still rely on outside experts because even if I had a PhD statistician and, you know, a room full of data scientists, they'd also, I also would need hydrologists and, you know, as you say, econometricians and, you know,
00:15:49
Speaker
experts in bookkeeping and I would need so many expertises that I couldn't possibly employ them all. So we're still as journalists, you know, there's a difference between a journalist and a witness, right? We were not there when the thing happened. We weren't in the room when the corruption was decided. We weren't there, you know, we are not, we didn't see it. So we are always kind of overcoming the fact that we're telling a story that we didn't see ourselves and
00:16:17
Speaker
And so, we're always relying on others, no matter how much expertise we have. But, you know, absolutely. And we work very frequently with, in fact, in every single project, we're working with someone who knows more than we do about a subject.
00:16:30
Speaker
Yeah, well, this is really interesting. I'm a big fan of the work you guys do. Thanks. And we'll keep exploring all the cool data sets you guys pull out from somewhere and put in a better format. Well, thanks, Scott, for coming on the show. I appreciate it. Thanks for having me. And thanks to everyone for listening. I am John Schwabisch, and this has been the Policy Vez Podcast. I will see you next week.