Introduction to Policy Vis Podcast
00:00:11
Speaker
Welcome back to the Policy Vis podcast. I'm your host, John Schwabisch. I am in lovely Southeast DC today. It's like 81 degrees, and I'm at the gorgeous day. I'm at the Department of Transportation with Dan Morgan, who's the chief data officer here at DOT. Dan, thanks for coming on the show. Thanks for having me. This is going to be exciting. We're also using the new podcasting gear, which is also a lot of fun. It's fancy. It's fancy. We'll see if it actually works. You're going to have to post a picture.
00:00:38
Speaker
Yeah, I'll have to pause the picture.
Pushing Open Data at DOT
00:00:40
Speaker
So we met a few weeks ago, first time at the Socrata Summit, where you were talking with other chief data officers about a whole slew of open data efforts in the federal government. Why don't we start by having you talk a little bit about the open data efforts that you're going through or trying to push through here at DOT? Sure. So one thing people always ask me is what kind of open data we have at the department, right?
00:01:03
Speaker
Something that's important to understand is what our mission is. It's a little bit different than maybe your local city or state DOT. So we operate the national airspace, and so the Federal Aviation Administration is part of us, right? And that's the reason you can track a plane moving across the sky and see when your loved ones are going to land at the airport, right? That's all open data that they've been providing for years.
Public Access to Safety Data
00:01:28
Speaker
And we're continuing to look for ways to add value to that data and make it more useful
00:01:33
Speaker
as they go through some IT upgrades and pursue upgrades to the air traffic system. We're also a regulator, so people will probably find our regulations in their day-to-days when you hear a car commercial for a
00:01:49
Speaker
for any vehicle that they talk about their five-star safety radio. That's our data. If you have a problem with your car and you need to complain about it, we got complaint data from you as well, which we released to the public. And we were doing it before the CFPB was cool.
00:02:07
Speaker
And we also have all of the recall information because, of course, for consumer protection, one of the most important things is to make sure that that information is available as broadly as possible. So we've been working on making that data available through APIs so that folks can find recall information basically anywhere they already exist.
00:02:27
Speaker
There's whole companies out there actually that build recall sites companies like we make it safer And a couple of others that that I know about that have talked to us about how they're using government data And of course recalls happen from all around government. So we've been actually collaborating with sister agencies like the Consumer Product Safety Commission and the FDA and
00:02:47
Speaker
and the food safety inspection service, you would be surprised how many consumer protection missions live around government.
Complex Datasets and Analysis
00:02:55
Speaker
Making that data available programmatically is really key to making sure people understand what's happening and they can take action to keep their families safe.
00:03:03
Speaker
So a lot of the data you just talked about was data that I think everyday people want to use. Like you said, track your family as they're flying across the country or learn about whether your broccoli is safe to eat. What's the slate of data sets that maybe are a little more in depth, require a little more work to sort of parse through and actually do analysis with? What does that ecosystem look like?
00:03:28
Speaker
Oh, yeah, there's some really nerdy policy stuff. When you pay your gas tax, that gas tax eventually flows into the Highway Trust Fund, which we pass out about $40 billion a year all across the country to the states to help maintain a safe and efficient roadway system. We can debate how well that's working, but we've got data about the condition of roads all around the country.
00:03:57
Speaker
You kind of need to be a civil engineer to approach the highway performance monitoring system, which covers all the millions of miles of road in the country in segments and tells you things about how rough the pavement is and those kinds of things. We've got the national bridge inventory, which is about 250 data elements wide.
State-Specific Data for Congress
00:04:19
Speaker
and getting bigger because we're going to be adding details, not just about the bridge overall, but about the deck and the substructure and the superstructure and whether there are bicyclists and pedestrian facilities on the bridge. So we got all these data points. There's 700,000 bridges across the country. When you hear the president and the state of the union say how many bridges are eligible for Medicare, that's our data. That's your data, right.
00:04:45
Speaker
So, we've got that kind of stuff too. It's really wonky, slow policy data, right? It comes out every year. And building like a time series about that is kind of difficult. So, that state of the infrastructure kind of data is in the agency. We also collect data about safety incidents. We collect it for aviation.
00:05:10
Speaker
rail, transit, fatal crashes, not all crashes. Ask me about that later. Okay. Hazardous material spills in transportation, not all hazmat spills, and pipeline incidents, which may or may not result in the leaking of the chemical.
00:05:28
Speaker
pipeline is transportation too. I think sometimes transportation of goods. So the analysts who work here, they are... Well, let me ask this. What are these sorts of products or reports or outputs that they're working on and who are they trying to communicate with?
00:05:47
Speaker
So, of course, we're trying to communicate to the public about how well we're doing. So, there's lots of aggregate reports. So, we tend to fall into a sort of rhythm of if the data moves annually, we'll report an annual sort of number of fatal crashes or a rate of fatal crashes and that requires two data sets to exist. One, how much driving there is and the other one, how many fatal crashes there were.
00:06:11
Speaker
We usually set a performance goal at a rate as opposed to just a wrong number because things are dynamic. Those are generally pretty static products and simple line charts. That'll give you a feel for what the overall trend nationwide is, but it turns out that transportation is a really local phenomenon. It's sometimes difficult to get into our data and build that community picture.
00:06:39
Speaker
And we don't really design a lot of individual communications. The other people that we're really trying to inform are Congress because they're the ones that are designing how much we get to spend on making the transportation system safe and efficient for all of you. And so there are a lot of state-based products.
Transportation Data Relevance Challenge
00:06:57
Speaker
We'll tally statistics.
00:06:59
Speaker
all the statistics we possibly can about a state, there's actually a product called State Facts and Figures, which is a synthesis of all of those different kinds of data sets that I talked about. How many people are registered with a driver's license, to how many vehicles are registered in that state, to the number of safety incidents that occurred, to the miles of road and different kinds of condition, to the miles of pipeline and the miles of railroad, et cetera, et cetera.
00:07:26
Speaker
Um, and so those are sort of good thumbnail products. And so, you know, usually a Senator or a delegation in the house can look and see how things are going in their communities. But of course they're hearing from their constituents. Yeah. Right. They're, they're currently back home here in their constituents right now.
00:07:43
Speaker
So we generally build these sort of report card kinds of things that move at the speed of the data that comes to us. And their perspective is sort of annual or quarterly or monthly, which has its utility, but it's not always enough. So if you look at your travel patterns, sometimes there's a lot of really short trips and sometimes there are long trips.
00:08:11
Speaker
And you don't care about necessarily that one slice of transportation safety or transportation infrastructure information just for yourself.
Improving Data Accessibility with APIs
00:08:23
Speaker
You care about all of those things at once because that's how you live.
00:08:27
Speaker
So we don't really create sort of that person-based sort of exploration of our data sets, right? To sort of build that picture, that integrated picture, we call it a multimodal picture of transportation facts and figures from your point of view.
00:08:47
Speaker
More from the policy makers. Correct. Or what the policy maker would want. Right. So how do you go about building that ecosystem, I guess, of data where you can pull all this together to get that person or place-based perspective? Well, I've been at it for three years, and I'm not done.
00:09:07
Speaker
I don't know if I'm doing a great job. Some of it is really about unlocking the data. For every dataset I just named, there is a different way that we release that dataset, a different time frame, a different format, and a different website. It's making it easy for everybody. It's super easy to find all of our stuff. Some of it for me is either pushing
00:09:33
Speaker
API access down to the source as best as possible. And we've made some small strides in that. For those of you who care about freight, there are now APIs from the Federal Motor Carrier Safety Administration that'll give you a census of trucking companies, what levels of insurance they have, what their safety records are, and those kinds of things. The National Highway Traffic Safety Administration, the people who make the five-star safety ratings, is working on APIs as well.
00:10:02
Speaker
Federal Railroad Administration has been tinkering with APIs and we're looking at our Socrata investment as a way to sort of accelerate that availability. And then bringing new tools into the department. So we've made a couple of investments in some visualization software like Tableau and a couple of others like it. And I should probably say somewhere in here that
00:10:25
Speaker
Those are just products we picked based on our requirements. There are many other ones out there and I'm not endorsing anyone. That's what we're going to do. That said, bringing those tools to our analysts are helping them uncover some things.
00:10:42
Speaker
I would joke that we have some of our national transportation statistics are compiled by bringing spreadsheets from other government agencies into our agency. And those spreadsheets have errors in them too. And so that's more of an artisanal dataset, as I like to call it. That's nice, yeah.
00:11:00
Speaker
Artisanal datasets are rife with problems. And these visualization tools are helping us discover some things that were longstanding issues that we're now able to address for the first time because we're just looking at our data differently and we're aggregating it differently. We're also storing it differently.
00:11:19
Speaker
A lot of what we release is sort of a presentation ready data set. It's more of an information kind of basis rather than sort of the way we store it here inside the agency. So my stakeholders are learning to sort of publish data for consumption and not for presentation.
00:11:38
Speaker
Gotcha. So that we can still ensure that we can give people some liberal ways to present that data, but still make sure that the numbers that are in there and the rule sets that we have around those numbers can still maintain some consistency. And that, I think, is really the big learning experience that we're going through. So when we started to bring things like Socrata and Tableau into the agency,
00:12:06
Speaker
People were like, I just can't see my data this way and I want to see it. And it's because the data wasn't and they didn't have data. They had information. And so it's a little bit of unpivoting and thinking about different storage techniques and sort of.
00:12:22
Speaker
thinking more from a programmer kind of perspective to the way we disseminate this data. Yeah, which is like a culture change for a lot of animals. So it took a little while to get there. But we've
Balancing Transparency and Data Misrepresentation
00:12:34
Speaker
made really huge strides in the last three or four months. And having these tools is really driving that change. So it's super exciting to see sort of what will happen next.
00:12:45
Speaker
And I'm just, I'm very fired up about how we're going to be able to make our data available in new and creative ways for folks like the ones that are listening to this podcast. Right. When you open data up, do you worry at all about
00:13:01
Speaker
who is going to be grabbing it and using it and doing some analysis and writing some news report or some other report. But maybe they're not the experts the way the folks here are experts. So they may not fully understand. The rate versus level is a great example, right? Like we see maps that are like, yeah, that's just a population. I put some dots on a map. And yay, look at this thing. My favorite clickable article. So do you worry about that? And if so,
00:13:30
Speaker
Do you try to put safeguards on the documentation? So how do you try to manage that? I mean, yes and no. So I don't let fear, uncertainty, and doubt sort of keep us from releasing data. That's not going to fly here. One thing I will say is that as an agency, we've been releasing some of these data since the 70s. So we had to have a culture of transparency anyway.
00:14:01
Speaker
We do worry about misrepresentation and misuse and those kinds of things. But there's always been a human being behind every one of those data sets. And we've always listed an email on a phone. Well, not always an email, but contact information for humans. And people take us up on it. And so on calls with reporters who are like, I'm interested in writing about transit. But before I go use the national transit database, which goes back to the
00:14:27
Speaker
80s, right? Can you walk me through some things? Yeah.
00:14:31
Speaker
Yeah, sure we can. What are you trying to do? Here's the tables that best suit that, and here's what you need to know about those tables. And people will ask us questions all the time, no matter how much documentation we put out there. So in a couple weeks, we're working with the National Science Foundation, Big Data Hubs, to try to get more of the big data academic community fired up about the increase in roadway fatalities that we've been experiencing the last couple of years.
00:14:59
Speaker
It's up 7% in 2015 over the previous year, and it's up another 7% in 2016 over 2015. So we've lost a decade of progress in two years. And we need to understand why. And I'll come back to that in a second.
00:15:19
Speaker
But the documentation for the census of fatal crashes, which is a very rich data set about people and vehicles and roadway conditions and all these other kinds of things, is 500 pages long in a PDF.
00:15:32
Speaker
because there's that many variables and that many code lists. And it's just, it's hard to get your arms around it. But then we have the analytical users manual. And by the way, there's always a new kind of car or vehicle, if you will, right? So like a small thing that becomes a big thing, electric powered bicycles, right? So you've seen these bicycles out with the motor assist. Is that a bicycle or
Understanding Crash Data Limitations
00:15:59
Speaker
Or a motorcycle. Oh, right. Interesting. I'm going to say no idea. No idea. But I've got to put a new code in the database for this thing. Right? Because it's a brand new thing that's like popular enough that I actually need to make a code for it. And if you approach the database and saw a bicycle,
00:16:20
Speaker
I know what image would pop in your head. So how do I say bicycle motorized? And how motorized is it? It's a 99, I think. It's usually a 99. The other things that are sort of challenging about some of that data set, it's very difficult to... People are very interested in drugged driving. And so things like marijuana laws that are happening around the country and potential impacts on roadway safety are a big deal.
00:16:50
Speaker
And different states have different testing rules and different toxicology parameters. And so you can't just read the value out of the database without understanding why there's missing this and what the protocols for testing in that state are.
00:17:07
Speaker
But if you were just trying to grab our data and look at it, you'd make the wrong conclusion. So hopefully it would look weird enough that somebody would pick up the phone or fire up off an email and ask us a question. So there's nothing wrong with exploratory data analysis. We do our best to explain some of these things, but we can't anticipate every question. No, no, right. You mentioned earlier that you don't have all the fatal crashes.
00:17:32
Speaker
Not all the crashes. You have all the fatal crashes. So I'm coming back to that. So you don't have all accidents. Correct. So this comes back to your point about policy versus place. So if I want to know what's going on in my small town in Nebraska, let's say, I may not know all the transportation
00:17:54
Speaker
issues. There's reporting thresholds for a lot of these things. So we actually don't have a straight up law that says every state needs to give us their file of fatal crashes. It's a cooperative agreement that we've been working for the last 40 plus years in the interest of public safety and public health. So fatal crashes were the scourge on our roadways in the 60s was 60,000 people a year. We got it down to 32,000 people a year.
00:18:23
Speaker
with a lot of hard work. Some of that was just seat belts. But, you know, speed and alcohol are still the number one killers on the roads. So, you know, we've made incredible progress over that long period of time because it turns out social change takes a while. So, I think one of the things that we didn't need that additional sort of non-fatal crash
00:18:52
Speaker
context, but we do a sample instead. So there's a nationwide sample and immediately it requires a level of sophistication to approach this data. The results from the sample are out there too. It's called a general estimate system. And we're working through a modernization of that sample because it hadn't been changed in a while. And it's going to become sort of the crash report sampling system, I believe, as it's new made.
00:19:17
Speaker
At any rate, we'll get the full details on a smaller subset of crashes. Different states have different rules about what they're allowed to release as it relates to crash data. Because they've agreed to give us at least fatal crashes, that stuff flows. States have liability laws that might expose them to risk. And if you had crash data that was out there,
00:19:40
Speaker
It could be used in a court of law to say you didn't take action quickly enough to fix you on my roadway. And that's why I got injured on this road. Just outside DOT, we've got a pretty dangerous intersection at M Street in New Jersey that you walked around, I think. So even in the short time that I've been here in the department, a couple of fatalities
00:20:06
Speaker
on that road. So there need to be some things that happen, but you have to prioritize resources and those kinds of things. Sometimes we use the privacy argument. Sometimes we use the liability argument.
00:20:20
Speaker
I think you're finding the policy driver that's pushing for transparency now is the Vision Zero efforts. So if people are following the local efforts to eliminate pedestrian and bicyclist fatal crashes in cities all around the country, all of those things are backed by data.
00:20:39
Speaker
Right. And community engagement as well. And so there's a transparency emerging in sort of revealing all of that crash data and traffic citations and red light camera violations and roadway conditions, bringing the full picture and actually talking about it with the community.
00:20:59
Speaker
about what the investments are, what we can do, how we can work together to make our streets safer, how we work together with law enforcement, who's there not only to prevent crime, but also to prevent traffic violations because traffic violations create risk. So how do we use our resources smarter and work together?
00:21:18
Speaker
is a conversation that is now backed by transparency and data. Here in DC, the Vision Zero website is one of the richest ones out there. Chicago has been doing some great stuff. So now we're able to get sort of full crash file data. But of course, every city doesn't release the same fields and doesn't call everything the same thing. The beauty of the census of fatal crashes is we work to normalize terms. So on a police accident report in Minnesota,
00:21:46
Speaker
They'll have all kinds of snow. Blowing snow, right? Blizzard snow, icy snow, yeah. Sounds familiar. We only care about snow. Right, right, right. In Florida, they probably don't have snow on their police. Yeah. Police acts are reported as other.
00:22:02
Speaker
It's the alligator field, and then there's the other. Right. So it's sunny, sunny, sunny, or rainy. So everybody uses different terms and different granularity across the country, and that's good because it meets their local needs. It inhibits national comparability. So these new open data things can be aggregated into something cool.
00:22:25
Speaker
but would require a lot of work to make something more uniform for analytical purposes. Does that make sense? So I think it's exciting, but hard. It's the best way I could describe it. So that means that someone here has to pull all those data sets together.
00:22:44
Speaker
So to give you a sense of the universe of crashes, so there are, I think the number for this year for last year is going to be somewhere in the 38 to 39,000 fatal crashes area.
00:22:58
Speaker
The total number of crashes that result in some sort of injury is something like three million. Right. A lot. A lot of. Yeah. Yeah. Right. So you see the end gets really big really fast. Yeah. Then then there's a universe of crashes that are not injury related but still reported to the police.
00:23:16
Speaker
And that's like six-ish million. And then there's a bunch of property damage only crashes that don't get reported to anybody but insurance, and sometimes not even them. And that's in sort of the 10 to 12 million area, I think. And it's sort of me trying to remember numbers from a period that I saw in a presentation.
00:23:39
Speaker
So we're at a place now where the end was big enough for us to make big policy decisions 40, 50 years ago. The end is small enough now where we need to bring new context to our data sets to make smarter policy decisions. And doing that in a way that is collaborative and
00:24:01
Speaker
sort of leverages open data is, I think, the way to do it in the 21st century. But finding our way to making that useful, I think, is still kind of a journey.
Tech Impact on DOT Data Practices
00:24:11
Speaker
Yeah. That make sense? Yeah. I want to ask one more question. You've talked a lot of things we've just been talking about are things that are happening and looking forward into the future of working with data, open data, and new tools. I wonder if there are technological things that you see on the horizon that are going to make your life
00:24:31
Speaker
Easier or harder or the or the department's life easier or harder when it comes to I mean the first thing that obviously pops to mind are driverless cars I mean are there are the things like that that you are all thinking about or working hard on? To look to the horizon of what's what's coming next
00:24:46
Speaker
Yeah, the data sets that I didn't talk about at all are our research data sets. And people who love research data or are trying to do secondary research on data, you can see what the future of transportation data looks like today. If you want to Google the intelligent transportation systems research data exchange, you'll see data sets from connected vehicle deployments that are happening around the country. So the near future of transportation involves
00:25:15
Speaker
vehicle to vehicle and vehicle to infrastructure communications. We've been working on the message sets for those for some time. And the data sets that are out there are very rich and also super nerdy, right? But we've been investing heavily in preserving the data from those deployments and sharing that data so that people can start to see what kinds of applications we can build.
00:25:41
Speaker
Adjunct to that research data exchange is the Open Source Application Development Portal. It has that full name. And you can find some of the applications that are being built.
00:25:53
Speaker
off of the data from those connected vehicle deployments and new concepts that are being tested on a regular basis. There are active deployments that are going to start sending data to us. In the next six to nine months, we've got the connected vehicle pilot in Wyoming, which is about sharing weather data with truckers because there are a lot of dangerous mountain passes and conditions can change in the mountains.
00:26:18
Speaker
frequently, which is dangerous for a giant big rig running down the road. So being able to share like if a truck could talk to another truck to say the wind has picked up in this past.
00:26:32
Speaker
The trucker behind can make some choices. We can get platooning and other kinds of efficiencies, a lot of these kinds of things. So we're excited about those. The other two deployments that are coming online are in Tampa and in New York City. Each one of them has a different application area.
00:26:49
Speaker
The other project that's obviously of interest is the Smart City Challenge winner in Columbus, Ohio. And I think that it'll probably still be several more months before they start generating data. But I think they're in a good place. And that was designed to be a laboratory environment where people could see the data that was being generated from the connected and automated vehicle deployments.
00:27:11
Speaker
and start to see what works and then replicate it. So the idea of this being open and available is a big deal. Automated vehicles are exciting, and the Secretary is reviewing the automated vehicles policy that came out in the last administration. But this is a harder spot.
00:27:33
Speaker
Because it's a very pre-competitive environment, right? There's a lot of experimentation happening right now, and people are trying to capitalize on a market. So data sharing is kind of stymied, right? And there are policy questions around sort of, what do I need? What do I actually need? And we don't know, because we don't know what all these things do.
00:27:58
Speaker
There are frameworks that can exist for data sharing that may maybe involve sort of the creation of a third party that could hold the data in escrow and sort of manage access a little bit better. It's difficult because not only is it pretty competitive, but we're also the regulator for that industry and sharing data with your regulator. Yeah, kind of fun.
00:28:20
Speaker
Well, it's just it's not it's not practice, right? Yeah. So, you know, the five star, like I said, in a pre competitive market, right, that right, so that changes the paradigm of the whole dynamic, right, right. You know, I think we're still working on, you know, conversations about we're not going to compete on safety, right? Like, safety is baked in these things that are coming online, right?
00:28:44
Speaker
And I'm not sure that we have uniform agreement on that at this stage of the development of the technology. So I think there's a lot of work to do yet. And it's just going to be a national conversation with this stuff. But what it means technologically for us is we're not going to be able to haul that data.
00:29:02
Speaker
all around. And we're not even going to want to keep all that data forever when it comes time.
Future of Scalable Data Systems
00:29:09
Speaker
So thinking ahead to what the data sharing paradigms look like and how to design a system that is open, replicable, and scalable that can cooperate more easily across just even county lines. And presumably in real time. That's all coming in real time. Now we're talking about, now we're not worried about annual data.
00:29:31
Speaker
huge amounts of data coming in real time. And it's not all going to come to the federal government. So how do states and cities up their game? Because are they really prepared for what's coming? The research questions that we're asking are, how do we design open source, scalable, and replicable architectures that can be picked up and deployed?
00:29:55
Speaker
And so we're in that study stage right now. I think it's actually, this is probably the most exciting time to be part of transportation since the invention of the automobile. I'm just going to put it out there.
00:30:10
Speaker
So I don't feel strongly about this. The data guy is excited about data. But what we're able to do and how we're able to get it done I think really matters.
Conclusion and Listener Engagement
00:30:22
Speaker
So I think there are going to be some sticky policy questions that are not going to get away from us. But being able to think...
00:30:30
Speaker
About what we might be able to get from data sharing and how that either fosters safety or consumer adoption or the purpose of data sharing maybe in the space is one for the regulatory purpose but to. To build consumer confidence right like.
00:30:48
Speaker
Some people trust the government still, right? You might not know that the five-star safety rating comes from NHTSA and the government, but that's okay. You know what five stars means? It means safe. And that's all that really matters. So what does that look like in a connected and automated vehicle world? I don't think we know quite yet. Interesting.
00:31:10
Speaker
Well, it sounds like you have your work cut out for you. So at least you're in a nice building in Southeast and near the stadium too. So it's all good. Dan, thanks so much for coming on the show. This has been interesting and a lot of data for people to go play around with. Absolutely. And if people can't find it, you can email me.
00:31:26
Speaker
Yeah, you can get in touch with Dan and I'll be sure to put as many links as I can on the show notes and of course to the main data portals at DOT and some of the other departments. So thanks everyone for tuning into this week's episode. If you have questions or comments, please do get in touch on the website or on Twitter or anywhere else. And please do rate and review the show on iTunes. And that's all we have for this week. So until next time, this has been the Policy Vis podcast. Thanks so much for listening.