Introduction of Nick Diakopoulos
00:00:11
Speaker
Welcome back to the Policy Viz Podcast. I am your host, John Schwabisch. I am very excited for my new guest, Nick Diakopoulos, professor at the University of Maryland College of Journalism. Nick, welcome to the show.
00:00:25
Speaker
Hi John, thanks for having me on today. Great to
Workshop on Algorithmic Transparency at Columbia
00:00:28
Speaker
have you. I wanted to have you on so we could talk about this recent workshop you put on at Columbia University that you were nice enough to invite me to because it was really an interesting day. So we met up in New York. This was in what, April now I think we're talking?
00:00:46
Speaker
And we were talking about working on algorithmic transparency in the media. So maybe you could give us a little summary of the workshop. I think you'd probably do it better justice than I would.
Media's Use of Algorithms and the Need for Standards
00:01:00
Speaker
Yeah, so basically this was a plan that I hatched with the Towson Center for Digital Journalism at Columbia to kind of jumpstart the conversation around how the media needs to start thinking about
00:01:16
Speaker
their use of algorithms and in particular about how they could be more transparent about their use of algorithms. And sort of digging into specific case studies. So, you know, people are using what they call robot journalists now, so automatic content generation algorithms. News organizations are using simulation and modeling and storytelling.
00:01:40
Speaker
They're also using algorithmically enhanced curation. We wanted to drill into these different case studies, get the participants talking specifically about what could be made transparent about algorithms in each of those cases, why you would make it transparent, what the benefits would be, what the
00:02:04
Speaker
cost would be and try to sort of come away with a little bit of a prototype standard or some framework for thinking about what are the kind of dimensions that you would want to be transparent about as you were publishing information using algorithms.
Broader Issues of Algorithmic Transparency
00:02:25
Speaker
Right, it was a fascinating day not only because from my perspective
00:02:30
Speaker
You know, when I went up there, I sort of thought, okay, so the real issue here is
00:02:35
Speaker
Some news organization is going to do some data analysis, some regression model, and they should be posting those data. They should be posting the methods by which they're analyzing the data. But we actually sort of worked our way into lots of other topics and ideas, right? Should the algorithms that are used to move comments around at the bottom of a page, should those algorithms be made more transparent?
00:03:01
Speaker
The ads pop up, should those be made transparent? I mean, there's a pretty wide range of things to think about here.
00:03:10
Speaker
Yeah, there's a lot of stuff to think about here. I mean, news organizations are using personalization now, so how are they using your personal data to adapt content or serve ads? How are they inferencing certain things so they're classifying you or they're classifying things?
00:03:32
Speaker
investigations and of course classifiers have error rates and you want to kind of, you know, let the reader essentially know
00:03:41
Speaker
how sure you are of your analysis. So I think there's lots of different kind of ways in which outruns are being used now. One example that I like a lot actually is the New York Times fourth down
Transparency through Visualization: NYT's Fourth Down Bot
00:03:58
Speaker
bot. I don't know if you saw this. So it's a sports bot running a model of American football where
00:04:09
Speaker
They're basically trying to predict what should a given team do on any given fourth down play. Should they pump, should they go for it, and so on. It's all very data driven and they have a well-reasoned model behind it. It's surfaced on Twitter as this bot. What I like about this is actually
00:04:34
Speaker
they have a whole page online, the New York Times runs this bot, and they have a whole page that actually visualize kind of the biases in the model. And they actually show you, you know, they sort of visualize the field and they show you, you know, what would the bot predict you should do at any given location on the field on a fourth down? And they kind of compare that to the actual data of what
00:05:01
Speaker
Coaches actually do at that position on the field. I mean it's just sort of I think a nice example which highlights the potential for visualization to help with transparency. So when you do have models that are running these things in the media, we can visualize those models in ways that help.
Balancing Transparency and Competitive Advantage
00:05:19
Speaker
The end user the consumer understand the biases in these models or the error rates in these models right now How do you view so so in that case the NFL data are essentially public, you know, you know some of it you can buy but essentially it's public so where how do you view new stories where the data is either
00:05:41
Speaker
the news organizations collecting it themselves, or they're purchasing the data, or they're creating a model that the model itself has monetary value. So where's the, do you see that, do you see it as a line between where they should and should not release the algorithms and release the data? Or is it sort of like, whenever a media organization is doing some analysis with data, they should be publishing everything.
00:06:09
Speaker
Well, data is a tricky thing, right? I mean, because people buy data all the time, or there's potentially private information that might be in data sets that you wouldn't want to make public. So it's not all straightforward. And I don't think that it's as easy as just saying, well, they should always publish the data behind their models. I mean, I think there's some caveat.
00:06:34
Speaker
In general, we want to know what is the quality of that data? Is it accurate? Is it complete? How is it collected? What's the methodology behind that collection? If you're training a model based on data, well, how much data did you train it on? Is it enough data that we can be confident in your model? Are there any other assumptions in the way that your data was collected?
00:07:00
Speaker
How was your data processed? Did you have to clean it? Did you edit it in some way? There's all kinds of decisions that get made along the way of how data is transformed that you might want to be transparent about. I think oftentimes the sort of counter argument for transparency is
00:07:23
Speaker
well you know it it it ruins our competitive advantage and you know for transparent about this then were we leave ourselves open to manipulation uh... and and and you know at times that's legitimate right i mean you know if you have a very valuable data set that you spent uh... a lot of time collecting uh... you know you might choose to protect that uh... and not publish it uh... you know
00:07:52
Speaker
as a way that other people could pick it up and use it, maintain that as a competitive advantage. But you might still disclose, like I was saying, certain dimensions of the data, the quality, its accuracy, how it's processed, whether or not it includes private information, stuff like that.
00:08:11
Speaker
And do you think, for the most part, that news organizations have the ability?
Where Should Transparency Information Be Stored?
00:08:15
Speaker
I mean, they certainly have the ability, but should these notes and caveats and descriptions, should those be, for the most part, housed in a
00:08:26
Speaker
Not with the article itself, I would assume, but on like a separate platform or in a blog. I know like the New York Times has some development blogs, so does the Washington Post. So is that the sort of place where things should be, should things all be housed together so that, like FiveThirtyEight, for example, has a GitHub site where you can go use some of the data that they've used? Yeah, I'm a big fan of
00:08:51
Speaker
kind of providing some interface affordance or some hook in the interface that people will see that's kind of salient. So that if they are interested in seeing sort of the behind the scenes work of what fed into this model,
00:09:08
Speaker
the data and the code and so on, they can sort of click into it and dive into it. And I think there are some really fundamental human-computer interaction challenges here, right? I mean, you don't want to overload the reader, the end user, right? I mean, you don't want cognitive overload. You don't want people feeling like, oh my god, there's all this stuff going on, and I have to understand everything, and why are they showing me all this stuff? I just want to read this article.
00:09:36
Speaker
But I think having some kind of way of sort of presenting information in a multi-scale fashion where at the surface level you can just read the article or the content, but with some kind of salient hooks to be able to drill in and say, okay, now I can sort of see the layer behind this.
00:10:01
Speaker
I don't necessarily have a strong feeling of whether or not that needs to be on another platform like GitHub or if it should be hosted by the media organization itself. I mean, the way I see most organizations doing this
Essential Skills for Data Journalists
00:10:14
Speaker
now is they'll throw
00:10:19
Speaker
workflow. And that seems to make sense. And there's other advantages to that as well, like just version control of your code and your data and so on. And of course, having the historical chain there for how a project evolved could also be an interesting thing to have access to.
00:10:42
Speaker
So I want to switch just a little bit because I'm curious about your thoughts since you teach journalists. I'm curious about your thoughts on the staffing and the training that sort of modern journalists need because the one thing that came up at the Columbia workshop was this sort of concept as journalists as researchers because I've written on this a little bit. You know sometimes it makes me nervous when I hear
00:11:09
Speaker
journalists say, well, you know, or journalism programs say, for example, oh, you know, our students now take two statistics classes so they know how to run a regression, and that always makes me cringe, because, you know, it's always like, do they know how to run one regression, but do they know how to identify a good regression versus a bad sort of approach? So I'm curious when you have
00:11:32
Speaker
the sort of new media where there's more data-driven analytics and they're running regressions. What is your view of like the skill set that new journalists need and to be responsible when they're working with data and then to sort of this next step of making it more transparent so people can sort of, so people can evaluate it and analyze the quality of the work.
00:11:54
Speaker
Yeah, I mean, that's a great question. And it's not an easy one. I mean, I think we're basically looking at data journalism as a new form of public social science. And you're having data journalists do original analyses and publish those original analyses, oftentimes without the benefit of peer review.
00:12:20
Speaker
Although, you know, data journalists do lean on each other internally to a news organization, or they lean on an editor to evaluate analysis, or in some cases they'll call, you know, a statistician friend, or they'll call a, you know, a statistics professor and say, hey, you know, is this crazy? You know, these are my results.
00:12:44
Speaker
and just kind of do a sanity check on it. Probably not as rigorous as a peer review kind of thing. So there are kind of models, I think, that exist for making this kind of work.
00:12:59
Speaker
viable in public, you know, outside of the traditional peer review model. Now, in terms of what that means for skills, I mean, I think, you know, we are looking at, you know, more statistics, more programming, so being able to move data around
00:13:16
Speaker
being able to set up a machine learning classifier in Python, for instance, being able to churn through large sets of documents or use APIs to apply other kinds of analyses. I think these are all important skills. Combine that now with what journalists have traditionally done well, which is communicate information,
00:13:43
Speaker
And there's a whole new set of skills I think that need to be developed in terms of the effective communication of data-driven investigations. So what's a good way to present a news app? It doesn't always need to be a written story. Sometimes it makes a lot more sense as a written story plus some charts, or maybe just as a data utility, where it's really not much text
Innovations in Data Transparency
00:14:10
Speaker
at all. It's more about
00:14:11
Speaker
letting the end user explore a dataset interactively or explore a dataset through visualizations and so on. So, I mean, I think there's sort of, in addition to the core journalism skills of reporting, writing, and editing, that writing part is, I think, being generalized out to communicating with data, which I think, more often than not, kind of comes down to data visualization, right?
00:14:41
Speaker
What's the effective way to visualize this dataset for an end user? Really interesting, really interesting. So you hosted the workshop or co-hosted the workshop, you have your students, so where are you headed next with this line of discussion and I assume research?
00:14:59
Speaker
Well, yeah, so for me, you know, I have a background going back into graduate school of human-computer interaction. And so in terms of the research agenda, for me, I'm very interested in developing new interaction techniques, new visualization techniques that help you be more transparent with algorithms that you're using. So one example of this is a project I'm continuing to work on with the
00:15:25
Speaker
the IEEE spectrum magazine. We published last year an interactive top ranking of programming languages. It's all very data driven and kind of an interesting new methodology for looking at top programming languages.
00:15:43
Speaker
And I'm sort of thinking about, well, how do you let people step into that ranking of top languages? How do you let them tweak and tune and re-weight and create their own ranking from those languages? So I'm sort of continuing to work with them on developing visualization techniques and interface techniques that allow people more transparency and flexibility in rankings that they're interacting with online.
00:16:09
Speaker
I have some other theoretical work that I think will build very nicely on top of this workshop, the Columbia Workshop, where we're really trying to think underneath it all, what would a standard look like for algorithmic transparency?
Creating Standards for Algorithmic Transparency
00:16:30
Speaker
What are all the different elements that will factor into that? And if we can articulate that,
00:16:41
Speaker
and get the industry to sort of look at that proto-standard and comment on it and iterate on it. And over time, maybe we can kind of agree and develop almost an industry understanding or hesitate to set a consortium, but some kind of straight understanding that these are the standards, these are the expectations ethically or professionally of using algorithms in the media and that it be
00:17:11
Speaker
integrated into other kinds of editorial policies. That is really interesting and you had mentioned earlier the GitHub approach and there's a workflow question there.
Impact of Data-driven Methods on Media Workflows
00:17:22
Speaker
I think that workflow question, I know lots of organizations are
00:17:26
Speaker
trying to figure out how this sort of new data-driven, I guess, society in which we live, how that affects their workflow. And I think that conversation, as part of this discussion on transparency, will be really interesting to see, especially as it comes to different media places. Yeah, absolutely. I mean, workflow is essentially what everyone's
Podcast Wrap-up and Farewell
00:17:57
Speaker
and do it efficiently and do it in teams, right? So collaboration workflows and so on. I'm not sure anyone's really solved that yet. Digital organizations have their workflow and they've kind of duct taped it together and it's sort of working to some extent, right? But I think there's lots of room for innovation in that.
00:18:24
Speaker
Well, Nick, this has been terrific. Thanks a lot for coming on. This has been really interesting. All right. Well, thanks for having me, John. And I hope to see you around town very soon. Absolutely. I look forward to the next workshop. Well, thanks, everyone, for listening. If you have questions or comments, please let me know. Hit me up on Twitter or visit the site, policyvis.com. I'm John Schwabisch, and this has been the Policy Vis podcast. Thanks so much for listening.