Introduction and Podcast Background
00:00:11
Speaker
Welcome back to the Policy Viz Podcast. I'm your host, John Schwabish. I hope everybody is well. I hope you're safe and I hope you're healthy. I'm excited to be back with more podcasts. I'll be going through about the end of June with some more episodes and then taking a couple months off for the summer, even though there probably won't be a lot of travel going on with the Schwabish household. But I'll be taking a little bit of a rest as we get into the summer months.
00:00:35
Speaker
So as you may know, over the last few weeks, well now months I would say, I've been hosting these Data at Urban digital discussions. They are one hour chats with me and a guest or two. We talk for 10-15 minutes and then a little bit of Q&A with folks who show up. And they've been a great experience for me to talk with people in lots of different fields of data, visualization,
00:01:00
Speaker
graphic design, people who are making tools, people who are working in data science. It's been a really great experience and I've really enjoyed seeing lots of people come out and chat with me, and of course the guests.
Repost Episode and Guests Introduction
00:01:12
Speaker
On this week's episode of the podcast, I'm going to repost one of those discussions. In this digital discussion, I chatted with two of my Urban Institute colleagues, Graham McDonald and Claire Bowen. Graham is a Chief Scientist at Urban and Claire is the Lead Data Scientist for Privacy and Data Security at the Urban Institute.
00:01:29
Speaker
We talk about their work, we talk about how they're helping researchers at Urban understand the issues around data security and privacy, and we also talk about other issues related to data security, related to data privacy, related to working with administrative data. So it's a really interesting conversation, there's a lot of interesting things happening in this space, and it was great to be able to sit down with Graham and Claire and talk about these various issues.
00:01:54
Speaker
Just a couple of the notes before we get into that discussion. I've been trying to post more blogs on my website, sort of some shorter things, not as long as I usually write, just trying to get a little bit more writing done, get a few more things out onto the blog. So lately I've been writing about things like some success I had helping the Social Security Administration improve the data visualizations in their reports.
00:02:20
Speaker
I've written about a visualization I created on the benefits of wearing face masks in the era of COVID-19. And of course I have a new book coming out later this year, Better Data Visualizations, and I hope you'll check that out on the show notes page and over on Amazon you can go and pre-order it right now. And that's very excited to see it coming out, so I'm excited for that.
00:02:41
Speaker
So this week's episode of the podcast is my discussion with Graham and Claire, and you'll hear, of course, some other people chiming in with their questions after our discussion. So I hope you'll enjoy this week's episode of the podcast with Graham McDonald and Claire Bone.
00:02:59
Speaker
Good afternoon, everyone. I'm John Schwabisch. Thanks so much for coming this afternoon to another Data at Urban digital discussion, digital chat. Hopefully you've been able to tune into some of these in the past. The plan is pretty simple. We have two great guests. Guests, I don't know what to call folks who are showing up for these
Guests' Background and Institute's Work
00:03:17
Speaker
people. Two great folks, two colleagues of mine at the Urban Institute. We're going to chat for like 10 or 15 minutes.
00:03:23
Speaker
about the work they're doing. And then we'll just open it up for questions and have a discussion. It's very casual, very low key. So if you have questions, just pop them into the chat window and I'll be able to build out a queue and then you'll be able to unmute yourself and have a discussion. Right now there's only about 35 of us, so we can just unmute ourselves and have at it. So I thought we would start by just having our two guests introduce themselves. So we have Graham McDonald,
00:03:52
Speaker
from the Urban Institute and Claire Bowen, also from the Urban Institute. And then we'll just, yeah, we'll just take it from there. So again, super laid back. So Claire, do you wanna start? Hi, I'm Claire Bowen, and I'm the lead data scientist at the Urban Institute, and I specialize in data privacy and data security. Am I supposed to say any words? No, that's good. That's good. All right. And just as a point of reference, you're in Santa Fe.
00:04:20
Speaker
Yeah, so right now I'm in Santa Fe, New Mexico. So I was going to be remote working anyway. This totally worked out.
Data Integration and Real-Time Challenges
00:04:32
Speaker
Okay. And then Graham, who is in DC. I'm in DC Metro map back here, a little data. There's no silver line on this one. Um, but yeah, I'm in DC chief data scientist here at the urban Institute, um, started as a housing researcher, did a lot of, uh, data vids right as you were coming on to urban. And then I left John and then came back and I've sort of built up the data science team. We're now seven folks working across all the researchers at urban to integrate
00:05:01
Speaker
anything you could think of called data science into our research service.
00:05:06
Speaker
Maybe we should start there, Graham. You can talk maybe a little bit about the team you've built out and I don't know, just the evolution maybe of what data science means at Urban and maybe for researchers at places like us in general. And then, Claire, I know you've put a couple things in the chat box where people take a look at on data security and privacy. So maybe we can just sort of sit away into the data privacy issue as well. But maybe
00:05:32
Speaker
I think it'd be interesting for people, Graham, to hear how you've been working to change how Urban uses data every day. Yeah, and let me know, you know, my internet's coming in and out these days, so just let me know if I'm slow and just, I don't know, yell and put your hands up or something. Okay, so yeah, so a little under four years ago now, I came back to Urban from Berkeley after having, you know,
00:05:57
Speaker
I've gotten my policy degree, but really just taking a bunch of programming courses at the School of Computer Science and Information there. And I really had this vision for Urban where we could use some of these new tools, you know, mostly a lot of cloud technology, but also, you know, any of the data science methods like machine learning, natural language processing, you know, I would put web scraping in there that we could use to sort of in our research to collect new data, to do new types of analyses, to, you know, have
00:06:27
Speaker
better real-time data collection because you know often in our policy world we have data sets that are you know we're at the neighborhood level we're using like five-year averages from the American Community Survey from 2014 to 2018 to talk about today right so there's all these like ways in which we want to better use data here at Urban and like it and you know John as you know before when I was at Urban we had
00:06:47
Speaker
than just putting out PDFs, right? And I'm sure when you were at the CBO, they were doing the same thing, right? And you're putting out these PDFs and you're like, you know, you see the graph of who reads a PDF, right? And no one here is going to be surprised when it's like, it's like not even your mom sometimes, which is like, really, like, at least you get like 10 to get your mom, like some of them are zero, right? And so building out the data science at Urban is sort of like, you know, when I think about building data vis into policy organizations was sort of this like obvious next step in how to communicate your work.
00:07:16
Speaker
and took a lot of effort but was like really valuable. I sort of see the same thing in data science at policy organizations today. I mean, don't get me wrong, still need more data biz at a policy organization and better data biz. I also think we need more and better data science to, you know, for example, we're doing projects where we use natural language processing on news articles to collect instances of major zoning reform so we can understand the impact of zoning reform on housing affordability.
00:07:43
Speaker
Right. Or like think about like the Kobe crisis right now. It's like there's a lot of creative people and including us thinking about how do we understand which neighborhoods or which areas are going to be most impacted by this crisis economically or socially or whatever that may be in real time because we don't want to wait three years to get that data right now. We want to have some proactive response. Right.
00:08:03
Speaker
And so I feel like data science is really important in this policy realm, not just for real time data, but also for using these new methods creatively to come at new data sources. And that's where we use sort of like big data methodology, new cloud architectures, APIs, things like that.
Data Privacy and Security Concerns
00:08:19
Speaker
Can you talk a little bit about I don't want to say the pushback. I guess the right word is challenge really something because it's not necessarily pushback because it is changing the way People who have worked with data or build models for a long time is changing the way they have to work right like You reference your data set on your C drive on your computer and then going to something in the cloud is a is kind of a
00:08:42
Speaker
different way to think and a different way to act. So I don't want to say problems. I think it is challenges, right? And maybe not even convincing people that it's a better way, but just helping them get to that better state.
00:08:55
Speaker
Yeah, no, that's a super good question. I'm gonna give you a boring answer, which is like, there's win-win mentality, right? I'm gonna go back to like Stephen Covey's book, like seven, right? Like you go and say, hey, you know, where's the win-win situation here? I'm not going to people by fiat and saying like, hey, you have to switch from your personal computer running SAS into running R in the cloud now, right? But there have been projects, you know, we were working on this, you know, Stagga job, this company that's the largest online app for connecting people with like, you know, sort of hourly jobs.
00:09:24
Speaker
And I never heard of it before, but they have like tens of millions of people across the US. So this is how, you know, I'm used to applying for jobs with a paper application when I was working in my minimum wage grocery store, right? But they're like, they have this app now. And you know, they're trying to, these researchers are sitting like with Stata on their personal computer trying to merge a 72 million record data set with a 24 million record data set and just like crashing their computer constantly and taking days and days. And then we come in and we're like, hey, let me show you how to do that in five minutes.
00:09:52
Speaker
And then they're like, you know, whereas in a previous project where it was like, you know, I have 10,000 records here, 20,000 here, I can figure it out myself. But then now that there's this huge problem, it's like, okay, this is an easy win-win for me. And then same thing with like the zoning reform I mentioned, right? It's, you know, how do we collect this totally new data that we never had before? In that case, you're like, wow, I can do all this new research that no other researcher can do because I have a data science capacity, right? So it's like, you know, we're not pushing in every area and 80% of the research projects are still doing the same things that they're
00:10:22
Speaker
doing but in areas where there's this real opportunity to innovate and do new things it seems like an obvious win-win and even then you're right it's there's still some let's say extreme skepticism as we're trained to have in the research field so that's also hard and I don't want to minimize that.
00:10:41
Speaker
So maybe this is a good segue to the data property security issues that Claire is specializing in. Because you already mentioned these huge data sets, they may have specific individualized observations. And I don't think Urban, we haven't really ignored those security issues, but
00:10:59
Speaker
they definitely change and are maybe even more important when we go from like the ACS or an aggregate data set to like your credit record. So um I don't know Claire do you want to I don't even know what the question is to be honest maybe I don't know maybe give like a baseline for people to start a framework for people to start thinking about when it comes to research what does data privacy mean
00:11:24
Speaker
Yeah, so data privacy is a huge umbrella and the specific area that I specialize in is more on the data releasing side of data privacy. So the example I usually give people is health care data because everybody kind of, well, especially now that's on their mind. So health care data is very useful data set to find early correlations on diseases, cancer, HIV, and now with COVID right now it would be really useful to figure out
00:11:50
Speaker
maybe certain kind of symptoms that are more common or people who are more targeted or affected. However, at individual levels, there's a lot of very personal information there in those healthcare data sets and researchers who use them shouldn't know who in the data set has COVID or has cancer, HIV, and so on.
00:12:08
Speaker
but they still need to have access to it. So kind of research I try to do is to make a kind of what we call a synthetic or like pseudo record, so fake kind of version of the data set that is statistically representative of the original but still useful for those medical researchers to use. And another caveat that I forgot to mention is that our taxpayer money goes towards collecting that data too, so people should have access to it at some level.
00:12:35
Speaker
And so here at Urban, we have a lot of, like Grant was talking about, very large data sets. And nowadays it's becoming much harder to protect people because we have these like computers in our pockets. So I talked a little bit about that in my blog where for the decennial 2010 US Census came out, the methods they use that have existed for many years basically roughly worked. But nowadays it's a lot harder to just
00:13:02
Speaker
Remove those PII's or personally identify automation or then one of the techniques that decently used was called swapping So they would randomly swap people's attributes with somebody else in another state. That's how they were protecting individuals but 2010 smartphones were barely a thing now Today's smartphones are stronger than our laptops or desktops that we had in 2010 so it's really hard to protect against just like brute force computational
00:13:28
Speaker
power that you can just find, oh look, we have now social media. Maybe people can be linked by those external data sets and we can find them in decennial in our healthcare data, which happens quite often. And so I guess going into the kind of segue of some of the links I provided, one of the kind of data sets that's really hard to protect against is called spatial data. And so the two links are talking about cell phone metadata because
00:13:56
Speaker
pretty accurate, right? We carry a lot of, all of us have smartphones. And so that information is being collected. And so in the, I think it's the second link, but that one's the older one, which is in December of 2019, showing from New York Times talking about how you could find out where people live and work because just how frequently or like what time of day that they would go between places. And one of the things they noticed was somebody worked, I might invert this, but so somebody worked at Amazon.
00:14:26
Speaker
it was obvious and then there was like one time they went to Google and then two months later they switched jobs and they started working so so it's like something like that but there was a article some couple years back from I believe it was from Stanford that talked about how they could figure out who had heart conditions because they would go visit a particular doctor and so that's that's kind of scary
00:14:50
Speaker
So is all this data, especially on the cell phones, are companies able to get these data because we all just clicked that little agree button on every app that pops up and we don't, no one ever reads it.
00:15:02
Speaker
Most of them, yes. Because you'll see that it's like, oh, are you okay sharing your location data? A lot of people say, yeah, I totally want Google to know where I'm at. And actually, one of the most identifying pieces of information is next to your social security number is your cell phone number, actually.
00:15:20
Speaker
because people are very resistant to changing that. And so when people ask you like, oh, would you like to be part of our membership to like, if you go to Costco, I don't know, actually Costco does as well. I was like thinking of like grocery stores. So like when I went in DC. My video has slowed down. There we go. Sorry. So like in DC, there's a Harris Teeter I went to as a grocery store there and they asked like, oh, would you like to become a membership? And you just need your cell phone number. And so that's how they, they track you.
00:15:49
Speaker
Great, this is gonna be the most terrifying conversation we're gonna have, I think. I'm so sorry. So there is a question that was sent over. So the question is about the recent blog posts about reduction in movements based on cell phone data where DC got an A plus and Wyoming got an F. It also wants to know what level of aggregation did they use and what kind of apps or sensors were they using to measure all of that?
00:16:15
Speaker
That's a good question. I haven't been able to dig in deeper because that was one of my questions was like, what was their baseline? And comparing people, was it several months ago or not? They did talk about it's from what was the company? There's a startup company that's actually gathering people. Yeah.
00:16:35
Speaker
And so that's what I'm very curious about and I tweeted about it because one of the things that bothered me when they were saying like DC got an A plus and Wyoming got an F was that I thought they ignored the social aspects of these different areas because like for me I came from Idaho and actually lived in Wyoming for a little bit because that's where my mother is right now and
00:16:59
Speaker
Unlike when I was in DC, I could just like right off the Metro, there was a Harris Teeter's at a grocery store and I would just go pick up when I needed to and it was pretty frequent until the COVID
Addressing Privacy in Research
00:17:09
Speaker
happened that I was oh okay I'll just go weekly. But in Wyoming or the more rural states, you live so far away that you usually like even without the COVID situation, you wouldn't be going to the store
00:17:21
Speaker
every other day like once a week or every two weeks or when I was growing up we went to our monthly trip to Costco like you got the biggest car you could sometimes you got a trailer you went to Missoula which is three hours drive away go to Costco and you just load in like a thousand dollars worth of stuff and that was supposed to hold you over until the next time yeah
00:17:51
Speaker
And John, I'll add that too. There's an urban world issue that Claire mentioned, but then there's also the fact that when you're analyzing these data, whether it's from the startup or other companies, you don't actually know usually the location and the unique user, but you don't usually know which app is the one recording it. And each app records it in different ways and at different intervals.
00:18:12
Speaker
So we'll just have all these locations from different apps in someone's phone. Maybe it's just one app, maybe it's 10 apps. You press yes 10 times to every app. You download it right and you have like, some people are like super users. You have like every minute you know exactly where they are. And then some people you know like once a day.
00:18:27
Speaker
Or what's a week that because they didn't open that app all week. Right. So yeah. Right. Right. Data quality issues there. Right. So can you two maybe talk about a project at urban where this issue became really acute.
00:18:44
Speaker
Because I think about, as someone who's just done research for a long time, like ACS or the CPS, a lot of the Census Bureau data, haven't really had these particular issues because the census tries to, they try to address the security and privacy issues. Graham, as you've built out your team at Urban and introduced some of these other data sets, is there like an example that you can provide to people where,
00:19:09
Speaker
Some privacy security issues come in that your more senior researcher may not really think hard about the security and privacy issues.
00:19:18
Speaker
Yeah, so I mean, there's a ton of examples here. And then one of the reasons we brought on Claire was so we could get better at this, right? Yeah, right, right, right. I can't memorize all these problems all at once and figure that out. And so part of it is that Claire is going to be in the future. We haven't decided when, because she just recently started a few months ago to disseminate these. I don't want to commit her to exactly when that's going to happen, but she's interested in doing that soon. And that's going to help disseminate practices throughout urban.
00:19:46
Speaker
Generally speaking, I think right now we're in the phase of trying to make people aware of the issue, and especially when it's related to de-identification. The most common use case where I think this can go wrong and where we're talking with a bunch of government agencies and also internal researchers about is whatever data collection effort you've gone through and you want to de-identify the data and release it publicly or release it to other researchers via an archive.
00:20:12
Speaker
It's like, what's the risk of that data being re-identified? In other words, you have all this really personal information that people might not want to have shared with other people. And now, like, that's just a random individual in the file. You don't know who that is. But then you use a few pieces of information like their zip code or, you know, their phone number, right? And all of a sudden you've re-identified the individual or their birth date or, you know.
00:20:37
Speaker
age, location, things like that are really sensitive. So the point is, how can we do that better? And I think Claire and I, and I'll let Claire talk in a second, have sort of talked with a bunch of people who are releasing data, both internally and externally. And one of those great examples is the decennial census. We're working directly with the Internal Revenue Service, with IRS, on how do we release their data in a way that's useful for researchers, because right now they release a public use file every year.
00:21:03
Speaker
And, you know, over the years, over the last few years, this public information, it's gotten less and less useful, right? Because they've had to create more and more draconian sort of privacy cuts to that data set. And so, you know, there's this issue around how do we best protect that data set in a way that works for the research users, but at the same time, protects everyone's individual privacy, because, you know, we don't want to share that information publicly with anybody.
00:21:32
Speaker
It's my time to jump in. All right. I'm just like, it's kind of harder to tell. It's like way my hand. So
00:21:43
Speaker
One of the issues that Graham was alluding to is what I would call the top five problems in data privacy is accessibility. So even though the field has existed for decades, there has been not a lot of communication, nor I guess the motivation became more prevalent with computers coming along. So when I say data privacy has existed for a while, it's existed more formally, I guess, like 60s or so when first papers were coming out, more technical.
00:22:11
Speaker
talking about like what we should do with the data like decennial and things like that so right now the field is very scattered so some people say oh it's originally in computer science or some people say statistics but it's beyond just that so computer science statistics economic social science and the list goes on and on of
00:22:31
Speaker
Of those who are interested in the field because they all want to access data in some form or have data and they want to share it with others. But the whole accessibility issue is either those who specialize in trying to release this data set and trying to figure out latest techniques or ways to analyze data that's more coarsely aggregated because of privacy issues. So there are techniques to try to like get better estimates.
00:22:53
Speaker
However, they tend to be very technical. So like the average user or anybody who's interested in trying to either apply it on their data or trying to use the data, they can't quite decipher it all.
00:23:05
Speaker
And even with that, some people think, okay, well, there has to be some packages because right now we have a lot of great open source
Data Science Practices and Privacy Policies
00:23:13
Speaker
languages. Why don't we use that? There are a couple packages existing, but they are either very specialized or not as well known or well used. So even on the computational front, it's a little tricky. Or sometimes the packages assume, like one of the packages I found, they assume you have a GPU or access to a computer with a GPU, which is
00:23:35
Speaker
not the average person's not going to have one of those especially the one that they said they used was like three to five thousand dollars just for the component that's very unrealistic even for like a government agency because there's red tape everywhere when you're trying to figure out like oh i want to buy this and your supervisor says why
00:23:57
Speaker
It's definitely a big, big hurdle. And so something that I'm trying to actively more work on is creating better communication materials about what are the latest techniques and like how do people get access to that or even just trying to decipher some of those technical papers a bit better for the average person.
00:24:14
Speaker
Can I frame this to John for a second? Do you mind? Yeah, of course. The real issue here is that, as Claire explains pretty well in her blog post, it's basically the usefulness of the data for us and for others to use it to help inform policymakers and improve the lives of people. And the privacy of it for the folks who don't want to be revealed are sort of directly at odds. And so what we're trying to do is ensure that
00:24:44
Speaker
We make the trade offs in the areas where the data is super needed publicly right where we reveal a little bit more privacy and maybe have the data be more useful and the other hand.
00:24:55
Speaker
like where we really actually don't need it, right? We can give privacy back to the individuals and make sure it's super secure, right? And so one of the things I think that why I think I'm really excited to have Claire and why we added her to the team and why we have so much work coming up with her is that it's a policy discussion and not a computer science discussion. And where this data privacy has been is in the computer science field, right? They're talking about here's how you protect privacy.
00:25:22
Speaker
right protect privacy like this and it's a really stringent standard for doing so and we sort of give up all the usefulness and now we need to actually have a conversation with both people in the room like I want to protect the privacy great here's the usefulness of the data great how do we have that conversation when they're like our terms like epsilon and no one really understands the mathematical definition of differential privacy and like what does that mean when you increase one versus the other right for my data set
00:25:50
Speaker
where i see like a lot of these sort of academic wonky debates have popped up over the last year with folks from like ipams or you know the census or others who are sort of trying to say you know we're doing it the right way no you're doing it the wrong way we're doing the right way you should do it this way sort of thing and i think you know we want to have a much more productive discussion than that i think because it's really a valuable one to have it's like a food fight right now nicely i guess
00:26:20
Speaker
So like the easy answer, well not the easy answer, but the easy like next stage is oh let's get everybody in a room and talk about it and have these conversations. But that's like the answer for everything is like let's bring both parties together and have a conversation. So looking beyond just having these conversations, what's the next step in this process of really getting these two, the computer science and the research sides together as not just they need to talk it out?
00:26:45
Speaker
Let me rephrase this question. If you two were in charge of all federal data and privacy in the United States, I guess my question is like, what is your ideal policy framework that you would put into place? Oh my gosh, that is a very good question.
00:27:05
Speaker
lot of power. Oh my god, yes. Start where you're starting with now and maybe we'll work up to that big answer. Yeah, I know. Actually one of the first things I was doing while I was in DC just starting the job was to start all those conversations with people and seeing like what talking about to the computer scientists, census, IRS,
00:27:26
Speaker
housing department, just trying to get a feel for like, what are their data needs, why they're trying to release other people, but then also talking with the data users. So internally in urban and externally as well, what are the needs. And so because of that, that's why I got inspired to do this data communication kind of initiative. So currently
00:27:47
Speaker
in the process of submitting a proposal to to make a series of what written and computational communication materials that were targeted based on these kind of meetings and trying to bring people together and solicit feedback from them frequently because even though we gather great information from all these meetings figuring out like what people want or what do they want to release is not always perfect because sometimes you make these poor assumptions about what the data users want or
00:28:13
Speaker
what does Census actually want to release or IRS or other government agencies? And so just kind of have that open dialogue. Now, hopefully, this better communication among everybody will help release more data because there are, frankly, a lot of great and valuable data sets that aren't being released out that could help inform policymakers.
00:28:35
Speaker
And so I guess the next step after that is that once we have this getting more researchers to analyze that data and then seeing how we can take that data to or these analyses and bring it up to the policymakers to make better decisions and
00:28:50
Speaker
I guess trying to figure out like the big big big picture right you said like if we had control of everything so trying to get all the different government agencies to accept these better practices it's actually harder than people like I don't know how many people have worked with government or worked in government it can be very challenging at times to adopt new practices and so like if I had like that I don't know that stamp or
00:29:20
Speaker
be like you're just yeah magic wand and be like you guys all accept this now right right to your point about like what would be the like okay we can just get people together all day but like that's expensive hard
00:29:35
Speaker
Like, I think what we want to do, like, here's the state of the field now, right? There's probably 10 to 15 people, and Claire can name every single one of them off the top of her head, who have enough knowledge and expertise in this field to be able to bridge that gap and help those conversations happen, right? And like, that's one of the reasons we hired her, and like, it was a long hiring process, and I think, you know, Stass, who just asked a question, I think, helped us with that in referring us to Claire also. So thank you, Stass.
00:30:04
Speaker
I think, you know, what, you know, what I think is, you know, such a important thing is to recognize that that's such a small field and to make the data look, we're not only hearing from federal policymakers or from researchers here, urban, we've been talking to local cities who are like, we want to, you know, keep our city accountable by releasing local data. But we're worried about releasing it, because we don't want to like, you know, we're a city who had an equity promise, we want to show that we're making progress toward hiring more
00:30:31
Speaker
you know, people of color, right? But releasing that data is sensitive, right? So how do you dig into that? And so they're worried about it. So there's like a ton more people in this world than like the 10 or 15 people that are like Claire to like be able to broker these conversations. So I think our first goal is how do we like get that second group, the information they need to quickly understand the problems and trade offs and be able to make those conversations happen much more broadly or
00:30:56
Speaker
you know, how do we get the government analyst or the head of the, you know, their new chief data officers in every government agency right now, right? And their chief data is popping up at the city level. How do we at least get them like to understand the trade offs? So how do we, you know, Claire saying, like, how do we produce communications materials? We're partnering with folks from like the future of privacy that, you know, foundation and others who are like real, real experts in this field and talk to policymakers all the time. How do we get them educated on how to talk about these trade offs, right?
00:31:23
Speaker
And so I think, you know, producing these materials producing some of these programming packages and like, you know, open source like our Python. So, you know, you can imagine like a playbook or a quick checklist or things like that that would help us to sort of get that next level involved.
Educating Government on Data Privacy
00:31:37
Speaker
We're not going to get everyone involved, but hopefully expand that field beyond just a few people that Claire knows. Right.
00:31:43
Speaker
Is it, um, do you think it's more important or maybe it's just equally important that this comes from the analyst level up or from the head of an agency down. I mean, it's easy. It is some ways easy to say from the top down because
00:32:00
Speaker
an agency head says, we're going to be doing this. And then that's the rule now. But I wonder if that's actually not how the world is. And so I wonder if you've already talked about all the different people that you're working with. When you look at the landscape of people out there doing this sort of work that it's important for, do you think about the analyst level first? Or do you think about the CDO or the agency head? Or is it just everybody in the ladder?
00:32:31
Speaker
Yeah. And I'll let you guys declare, let me have the super high level answer, which is the high level people are like, I'm super worried about privacy. I want to make this trade off in a responsible way. I want to keep this data out there. How do I do it? Right. Analysts is like, man, this is way over my pay grade. I'm not a privacy researcher.
00:32:50
Speaker
And we want to like help the analysts have the tools and help the chief data officer have like the one or two page like, here's what you need to do. And here's how you say yes. And here's how you empower your analyst sort of thing. But as you know, like different audiences, different products, right? So, but like, I think they both have their own specific problem. Claire has talked to more of them though, I'll let her.
00:33:11
Speaker
I was going to say it's definitely throughout that that ladder because I'm just going to give an example. I was working at the Census Bureau early on when they were thinking about switching over to differential privacy and so it did come from higher up saying like this is what we're going to do and when I was working in the Center for Survey Research and Methodology they a lot of people were just were like I don't what is what is Epsilon? I have no idea they were all very confused and
00:33:39
Speaker
They brought in these big name researchers to give these talks and I just remember my supervisor who like we kept staring at each other during this one talk that was supposed to help educate the group on differential privacy and they were all confused. He had this look of WTF.
00:33:59
Speaker
on his face and so then afterwards he looks at me it's like you got a private chat with me it's like Claire you saw my face right he's like you do you think I understood what was going on like no I don't think she's like okay you need to do better because I'm gonna make you give a talk on this
00:34:20
Speaker
So anyways, that's just like my story about how we do need to like think about everything in between or else like you're not, if you don't have everybody else who are, I guess, like the ground floor, like working on the data, analyzing it on board, then it's gonna be much harder to like move the whole organization, right? Right, right. Yeah, that's great. So we have a question from Gricia.
00:34:46
Speaker
Thank you. Hi. Thanks so much for doing this. My question relating to governments, specifically like local city governments, do you think they would be more likely to adopt these kinds of like best practices or like, Graham, I think you mentioned like a playbook kind of thing, if more of the big players started adopting them. So for example, I've like, you somehow convince like the US Census or the IRS or like
00:35:13
Speaker
big city of New York or city of LA to adopt them. And then other people would kind of feel the pressure, like be more willing to try it out since it's kind of been standardized because of these big players adopting them. That's such a good question. I love that question. So yes, I definitely think you're right. I think right now the fact of the matter is everyone's looking at census.
00:35:39
Speaker
Right, so like, for better or worse, that's our first big picture example of what's going to happen. And I think people are sort of holding their breath and waiting. Claire and I and Rob Santos, who's our chief methodologist at Urban, have tons of opinions about what census has done well and then what we say diplomatically what they could have done better about this differentially private release. And I think people are waiting to see like, well, what are those lessons learned and how might I apply that as a local government?
00:36:07
Speaker
I also think, you know, frankly, when we've been talking to local governments. That's not just like your LA or New York, you know, folks like Austin or Kansas City are also interested in this stuff from my conversations with them. And, you know, I think they are really interested in it. It's like, give me a tool, a specific tool that I can apply level of interest. So like, whereas
00:36:26
Speaker
you know your folks at the IRS or HUD or census might be much more willing to say like well my data is super special maybe we have a little bit of a budget to try and like work through these issues of privacy and security versus usefulness it's you know a Kansas City or someone might be like as you said the next level after that once you've seen a few models and you're like oh well this one is the right one for me because they use this type of data sort of thing.
00:36:50
Speaker
So I definitely agree with you that it's going to like sort of cascade or waterfall down but census right now is at the top of that waterfall metaphorically and we're all sort of just waiting to see what happens.
00:37:05
Speaker
The only thing I'm going to add to that is that I realized we haven't talked about the fact that a lot of places from federal all the way down to a local government level actually don't know about privacy issues either. There are some who think that I just removed the personally identifiable information and that's sufficient. But again, we have so much extra data out there, social media,
00:37:26
Speaker
And just, or what I call innocent pieces of information that people don't think about can be used against them because it has enough information to link them to a dataset. And if people want some references, there's like, I think the classic example, this has been from 2008, is the Netflix prize.
00:37:46
Speaker
where, yeah, Netflix released some of their data sets for those who don't know, they were going to give $1 million for researchers to improve the recommendation system by, I think it was like 10% was the threshold. And so one research group, instead of trying to improve the recommendation system, tried to see if they could identify the people in the data set. And they were because of, I always say this wrong, the IMDB,
00:38:09
Speaker
Database. Yeah. So they were just able to use that and find people. And this is back again in 2008. So imagine like what we can do now with much more powerful computers and there's more information because like social media was kind of merging. That was like when MySpace was still on the thing.
00:38:29
Speaker
I imagine now what we can do. And Netflix fries data sets. Some people are like, I don't care if people know what I like. Claire, connect the dots on why social media is a good source of linking data. Is that valuable? Yeah. Claire, can you describe that? Why is social media a problem when you're talking about re-identifying?
00:38:49
Speaker
Oh yeah so to expand a bit Facebook is a really good one because a lot of people will put down like oh I was alumni of this university or high school this is the year I graduated a lot of people have their date of birth actually on on Facebook and so you can just link those to to other data sets that
00:39:08
Speaker
have maybe they don't have the year of your date of birth but they have your month and day because they thought oh this is it's fine nobody needs to know or saying like these is an education data set and this is us tracking people through time and so because of your Facebook feed you said like oh I graduated this place or let's say a workforce data set some people put down on their
00:39:30
Speaker
Again, I'm picking on Facebook because it's a really easy one. That you have like, oh, this is the first place I worked. I work now at this other location. Or there was a case of a woman from Harvard who was able to link people posting about if they had an accident or something like that. Like, oh, so and so went to the hospital and you can link them to the health care data set.
00:39:50
Speaker
And then depending on the social media platform, people don't realize you just take a picture of like your pet in your home, which is a very popular social media activity, as we all know. Or else you go on social media anyway. And then, oh, come on, I guess to hear your relative rant about politics.
Historical Breaches and Cloud Security
00:40:03
Speaker
But other than that, you know, so like you just take a picture and like a lot of our phones are geo enabled by default and our images, right? And in the image metadata of most of our images is a geolocation, right? So now I know
00:40:16
Speaker
the block or zip code in which you live, which is super valuable for linking you across data sets because you may be, your name may not be unique across the United States, but it's probably pretty unique in your zip code, right? Yeah, I'm like, I'm poking around my office. I'm in the middle of a book and I can't find it right now. It's on algorithms and like in the introductory chapter is about Boston releasing some health and health data and saying, oh, it's secure. And the reporter was like, well, let's see. And I think it was like,
00:40:46
Speaker
I think whoever released it might have been the governor, the mayor, so that's completely secure and the reporter is able to track his specific address and location from the health record.
00:40:58
Speaker
So it was actually Latanya Sweeney. She's now a full professor at Harvard. So she did this as her graduate student project. She was at MIT at the time. And so, yeah, the governor of Massachusetts is like, we're going to release all this data. And it is useful. It was like the federal, or excuse me, not federal, the statewide employee data set. And he was like, we removed all the personfully identifiable information.
00:41:25
Speaker
It's perfectly fine. And she record linked it to voter data and and sent an envelope of all his personal health care records like directly to his office. I was like, wow, that is
00:41:42
Speaker
That's nicely done, yeah. So there are a few, I have a few questions in the chat box. So there's one, I think that's to everyone's, I guess I'll just read it quickly. Do either of you have recommendations for a secure cloud storage use if there's PII in a data set? Elizabeth wrote in, she has a small organization, they don't have an internal server, so they rely on cloud storage. So I don't know if you have thoughts on that either of you.
00:42:08
Speaker
So we do actually, so I built up a lot of our cloud infrastructure along with our DevOps team here at Urban. So we do have an IT department over for 30 people, so we're in a privileged position. I definitely recognize that. There's nothing inherently insecure about the cloud, but you do have to be really careful about how you architect the systems. So we do have protocols around
00:42:30
Speaker
you know, checks and security logs and things like that that you sort of need to have in place just like you would in your on-premise environment. And then you need to understand how data are being transferred between your organization and the cloud, right? Because a lot of these organizations that have virtual private clouds, you know, that are connected directly to their organization's network, they have what are called like, you know, in AWS speak, it's called direct connect, right? You have a direct line that goes from you where no other internet traffic is going over that line,
00:42:58
Speaker
And if you are transferring data that's not over that line need to ensure that it's encrypted in transit and encrypted at rest and most of the modern tools will do that. But like what you really want to ensure is that you have a system that's built and you may need to like we actually know it's not like a thing that we like to do, but we have
00:43:14
Speaker
peer or friend organizations that we're nice with that work with where, you know, me and our DevOps team will like just consult with them for a few hours to say like, yeah, no, that's not the way you should set it up. You should set it up this way. Just to be nice. It's not like Urban's business model or anything. We just want to see people succeed. But there are, you know, you do want to at minimum, make sure you have encryption at rest, encryption and transit.
Public Attitudes Toward Privacy Post-Pandemic
00:43:34
Speaker
And if there's any data that's governed by
00:43:36
Speaker
a federal law such as HIPAA or FERPA or any of these, you know, health education or other privacy laws. You need to ensure that you're storing it in a system that is built for that and that any data that's transferred to and from that system is also following those regulations. The good news is a lot of the cloud providers do have that. By default, you can access a list for any cloud provider. Just like Google, AWS HIPAA certified services, right? Or Azure HIPAA FERPA certified services. They'll give you a full list of everything that's certified and available to use.
00:44:03
Speaker
So that's a really, a benefit is like sort of out of the box and it's better than on premise because you can sort of just build it out. But, you know, generally speaking, you do want to have somebody, you know, we have five certified solutions architects on staff. You probably want to have at least one person who really knows the cloud really well and knows how you're working with data at your organization to at least double check that that system is secure, whether you get a consultant for a few hours or some other method. Right.
00:44:28
Speaker
Great. I hope that was helpful. So a couple other questions. This one from Daniel. I like this question. So the question is, do you expect the public's tolerance for privacy to change given the recent COVID-19 pandemic? So after 9-11, we had the Patriot Act, which people seemed at the time just obviously in the immediate aftermath, we seemed willing to give up some of our privacy for security. And that has kind of
00:44:58
Speaker
depending on who's trying to sell the message. But it's a good question about whether you think people's tolerance for privacy is gonna change given the current pandemic. As an example, this isn't in the question, but as an example that I saw, DC is publishing the gender and age of everyone who's identified to be infected with COVID. So I just wonder whether you think things will change in one direction or another once we're through this moment in time.
00:45:27
Speaker
Claire, do you want to take it first? Yeah, I'll try to take a stab at it. It's not a good response to it, I think. The example I think of more prominently is that when Facebook, Cambridge Analytica, kind of came out, people were very upset about it, right? And then they were trying to do some changes. But obviously, things didn't change a whole lot other than now that people can reference it saying this is an issue.
00:45:52
Speaker
I can see both ways going. I can see people relaxing it up because with the, for instance, that metadata, that is really useful for figuring out not just like a COVID pandemic, but also other emergency responses, such as if we get hit by hurricanes a lot too.
00:46:07
Speaker
trying to figure out like where can we identify people, where are the choke points in our road systems to get people out and things like that. But then I can also see the other way of like, what if all these, the state of being released, you said that the gender and age and people who are being infected, it could be very targeted for like insurance companies saying like, hey, you had COVID before. And so your lungs are now maybe like more compromised, your respiratory system. So we're going to raise your rates now. Right.
00:46:34
Speaker
Should that be fair? No, but that's what ensure companies will do. There have been cases where if they know that you're smoking, it isn't your health insurance go up, your car insurance goes up because they think that you're more likely to pass away.
00:46:47
Speaker
So I guess that wasn't a very precise answer of what I think one way or the other, but some of the implications. And I'll just add one thing, which is like the, I was in this like sort of closed door session where there's this, I can't name the funder or the organization where they're doing like market research with like average everyday Americans on like data usage and sort of like data and data privacy. And a concept was how can we better like.
00:47:11
Speaker
market integrated data systems, right? Like merging your healthcare data with your criminal justice data to like do good in the world, right? Because there's a lot of systems around the country that are trying to like merge different data sets across fields to like help homeless people better access multiple services instead of like each person treating them as like they've never seen them before.
00:47:29
Speaker
Right. So, you know, that is super valuable. And what they sort of like, the summary that I took away from that conversation is like, people are okay with it if you tell them exactly how it's being used and they're like, okay, with that use of the data. So I think like, for Claire's point, it's like insurance company could go use this and like people's default view is, is on average going to be, well, someone's going to do something bad with this data. So it's not going to be good for me. And like, I can't blame them. I don't know if I'm probably up on the same page with them.
00:47:58
Speaker
When you tell them, hey, this is going to be used so we can better teach the kids in your school how to do X, or better support the kids who are having trouble at home, or we can help the firefighters better prioritize where to go, or we can help your local...
00:48:15
Speaker
those sort of cases where it's like there's a clear good and we're using data to do this X. That's like what resonated most with most people, right? Is that they could actually, there was actually something concrete that was good that was going to happen with the data. And I think if we have some protections of ideas around that, it would like, that would I think help us in the long run. But I am like a little bit sort of on Claire's side. I think like I am a bit pessimistic. Like if we do relax it, I do think that there will be a lot of those struggles like with the insurance companies. Yeah.
00:48:44
Speaker
Actually, there was a, I think, a paper I read that they surveyed a bunch of people about what they thought about privacy. And the general consensus was that people were okay, like Graham said, as long as you told them what the data was going to be used for and that whoever collected the data did the best they could.
00:49:03
Speaker
to protect it and they were open about the process on which they did so because one of the theoretical questions like what if it got broken in and people were more okay if it got broken because like after it was like password protected encrypted and all this other things they're like oh it happens like they tried their best like because it was for a greater good or something like that.
Equity and Legislative Impact on Data Privacy
00:49:25
Speaker
Right. We're almost out of time here, but Sarah has a question that I thought was really interesting. So yeah, she unmuted. So Sarah, go ahead. Yeah, this kind of picks up on what Graham was just saying a bit. But I was wondering if there's work that folks have done on sort of the equity dynamics of re-identification risk, like are different groups more or less vulnerable to being re-identified? And then given that, are different groups more likely to see harm or benefit from re-identification?
00:49:55
Speaker
Claire, I don't know of any that have taken explicit equity focus, but that just could be me not knowing as much as the literature. Do you know of any? Not specifically other than what we call the small population problem. So it's another like when I said like top five challenges of data privacy, and one of them is the small population problem. Because on one hand, you need that kind of finer grain detailed data to have more targeted benefits of thinking of, for instance, I'm writing a proposal with somebody in our Metro center.
00:50:24
Speaker
on trying to figure out, oh, can we get access to employee and employer data to be better about helping rural communities do startups and help their small businesses flourish. However, at that such a fine-grained detail of geography and with certain demographic information, then it's going to be very identifying. For example, I come from salmon, Idaho. I talk about that in my blog. There's 3,000 people there. And I was the only Asian-American high schooler.
00:50:54
Speaker
So some people could definitely find me there. So unfortunately, not to my knowledge, there's any papers looking into the equity. It's a very important issue. I like to tell people that there's so many cool and interesting problems that we really need to address and work on, but there are not enough clairs.
00:51:17
Speaker
It's a really open place. I think like if I'm going to, it's sort of like why we haven't seen it, any action is a good, is also a good question. And I would, I think we should see more research in this field. I think part of the reason we haven't, and I think maybe census might be the first to do this is.
00:51:33
Speaker
As you said, Claire, there's a small population problem and then there's the non-response or are they included in the data problem? In some ways we over surveil people in criminal justice, but in other data sets we often have to correct or oversample lower income people or people of color or Latino people because they aren't responding to
00:51:51
Speaker
the surveys at a high of a rate and that also just creates the exact same problem that Claire was mentioning in terms of how do you do representative risk and you know one of the things we were talking with uh we're talking with the city of LA a couple months ago about a spatial a spatial equity representation tool essentially is a tool we're building that you know helps governments give us point data on where they're like investing and we can say well these you're over investing in higher income white neighborhoods or you're right
00:52:19
Speaker
And we use underlying census data to do that analysis. And the city of LA is like, yeah, that's a great tool. And we can see how we really need it. On the other hand, we have a huge census non-response rate. We have one of the highest in the country. So it's like, why do I even trust your equity analysis? Because there's so many people that aren't included in the data that we trust as ground truth today that I'm questioning anything you do with that data. And so I think that's also an interesting, larger question to consider.
00:52:46
Speaker
Yeah, there was, um, I don't know if I'd call it research, but in, um, Kathy O'Neill's book, Weapons of Mass Destruction, she, she has a lot of examples of these. I would, I don't know. I'd call that, you know, research per se. I mean, she's researched to write the book, but probably not in the sense that, um, we were just talking about.
00:53:02
Speaker
Um, so we have, um, I think we have like just a couple more minutes. So I want to let Daniel come back in on this question about, you know, whether people's perspective on data security and privacy will change. So, um, he had a follow up question. So Daniel, go ahead. Hello. Can you hear me?
00:53:19
Speaker
Yes. Okay, great. Hey, Graham. Yeah, so I wanted to push back a little bit on your comment, Claire, because, you know, while the surveys are nice, I haven't really seen, like, more sort of legislative or popular pushback. And I was curious on whether you thought the light punishment either of the Equifax hacked or any of sort of repercussions on Facebook because of the Cambridge Analytical scandal, sort of
00:53:47
Speaker
reinforce the idea that the public, you know, perhaps doesn't care. It doesn't particularly understand sufficiently about privacy to do something about it. Do you have either examples of like companies being correctly punished or maybe sort of organizations that are really promoting sort of legislative change.
00:54:08
Speaker
So that's a good question. This is a little bit outside of my my expertise. It's more like I'm aware of these things because they are tendentially related to my research. So in terms of like companies being directly punished. I can't think of any right now on the top of my head, but I know that given things like Facebook.
00:54:27
Speaker
what happened there. There were other companies who responded to that in their own practices because they were afraid that either thinking that the users wouldn't trust them anymore and not use their products or the fact that they like
00:54:43
Speaker
Maybe they have, I shouldn't say it like that, but like some of them do care and try to be like a good acting citizen entity or something like that. So like, for instance, like right after the the Analytica Sanil happened, that was actually, that incident happened two days before I defended my thesis.
00:55:00
Speaker
So that was very interesting and timely. And so when I went out on the job market, there was actually a lot more positions up for a data privacy specialist. And I think it was in response to kind of that is going like, hey, we don't want this to happen at our company. We don't want to be on the scrutiny of Congress. We need to re
00:55:19
Speaker
to respond and be better citizens. And so on to the policy side. So if you're interested in that area, you should look at the future privacy forums. They are a non-profit who very specifically try to talk to congressmen and women on trying to enact better laws. So like they were following a lot with the Washington state when they were enacting one of their legislation. Unfortunately, that fell through. So they're going to be pushing again to try to get better privacy.
00:55:49
Speaker
I'm trying not to wrap it up with the time, but like there's other privacy issues such as when you go on to shop and people are like, oh, you want to be part of our newsletter. That's actually a privacy risk because they are kind of indirectly forcing you to be part of their, their server. So yeah, interesting. Yeah, so
00:56:07
Speaker
Wow. And I will say, like, just make a comment, John, that GDPR has actually forced a lot of innovation in this area.
Conclusion and Future Discussions
00:56:15
Speaker
I hate to say that because people are really anti GDPR most of the time, but I think it has and the companies are basically
00:56:22
Speaker
trying to do this, you know, they're essentially taking these systems now like social science one is an area where Facebook had put out a bunch of public data on Facebook shares took them two years to build a private system that researchers can access and their lawyers, if you look at their public blog on that, their lawyers were almost not releasing the data because they're so worried about GDPR fines and penalties. So, you know, they're really, I think there really has been a shift as Claire said in some of these private companies like LinkedIn and others of really better protecting people's data.
00:56:51
Speaker
Yeah. We started this hour by saying, what are we going to talk about? I don't know. We'll just kind of wing it and figure it out. And we just like hit the hour mark. And there's more that we could talk about, I'm sure. And I'm sorry to the folks who sent in questions and we didn't get to them. I'm sure everybody else has another Zoom call they have to get to. So I'll just say thanks to everyone for coming in, tuning in today. And a special thanks to Graham and Claire for chatting.
00:57:17
Speaker
Really cool. I put the link to the rest of this week's lineup. Tomorrow, two more urban colleagues of mine, Rob Santos, who's on this call I see. Hi, Rob. And Diana Elliott will be talking about their project from last year on 2020 census and what COVID will mean for the counts that are going on right now. And then more great folks coming up the rest of the week. So thanks, everyone, for tuning in, coming on. I appreciate it. And yeah.
00:57:45
Speaker
Keep in touch. Let me know who else you'd like to hear from on this on these jets. So thanks, everybody. Have a good one. Stay safe. Stay healthy. Take care.
00:58:00
Speaker
And thanks to everyone for tuning into this week's episode. I hope you enjoyed that. I hope you found it interesting. I am going to post some more of those discussions in the coming weeks that we had on the digital discussion series. If you're interested in hearing more of them or even seeing more of them, you can head over to the urban events page where we have posted recordings of nearly all of those talks on that page. So you can go watch them over there at the urban events page that'll link you over to the urban YouTube page.
00:58:30
Speaker
channel. So I hope everyone is well and safe and healthy. So until next time, this has been the policy of his podcast. Thanks so much for listening.