Podcast Sponsorship
00:00:00
Speaker
This episode of the PolicyViz podcast is brought to you by JMP, Statistical Discovery Software from SAS. JMP, spelled J-M-P, is an easy to use tool that connects powerful analytics with interactive graphics. The drag and drop interface of JMP enables quick exploration of data to identify patterns, interactions, and outliers.
00:00:19
Speaker
JUMP has a scripting language for reproducibility and interfacing with R. Click on this episode's sponsored link to receive a free info kit that includes an interview with DataVis experts Kaiser Fung and Alberto Cairo. In the interview, they discuss information gathering, analysis, and communicating results.
Introduction to GovTech Singapore
00:00:50
Speaker
to the PolicyViz podcast. I'm your host, John Schwabisch. I'm very excited for this week's episode because we're going around the world. On this week's
Role of GovTech in Smart Nation Technology
00:00:58
Speaker
episode, I'm very excited to have Feng Wan and Xiangxin Li from GovTech in Singapore. GovTech is the government technology agency in Singapore.
00:01:09
Speaker
They're responsible for implementing the smart nation or the smart city technology. They're doing a lot of exciting work there in open data and data visualization and using data to tackle some important policy issues. I first found out about their work actually through a former student of mine, Anthea Pyeong.
00:01:28
Speaker
who sent me a really fascinating blog post about a project that GovTech had done on, let's call it a rogue train, and what was causing some issues with a rogue train in the Singapore subway system and how GovTech drove into the data and uncovered the mystery. So we're going to get to
Meet the Guests from GovTech
00:01:47
Speaker
that in a moment, but before we do so, I'm very excited to introduce to everybody Feng Wan and Chong Xinli from GovTech. Feng, Chong, Xian, thanks so much for coming on the show.
00:01:56
Speaker
Thanks so much, John. Yeah, it's very exciting to have you. What I'd like to do is start by maybe having each of you sort of introduce yourselves, talk about your background and the work that you do at GovTech, and then we can dive into the work that you do there and the goals that GovTech has in Singapore. So, Feng, can we start with you? What's your background? How did you get involved with GovTech? And what do you do there? Thanks so much, John, for having us. Feng, you're in.
00:02:26
Speaker
from GovTech, I run essentially a multidisciplinary data science team here at GovTech. My motivation was really having worked in government in various finance, transportation and energy policy roles. I was really motivated by harnessing data and evidence to make policy decisions and operational decisions. So essentially it's kind of about data for the public good.
00:02:52
Speaker
My previous job was in public transportation and we were trying to use data to better target where we added public transport capacity. With a lot more interest in data science and technology evolving, what I really wanted to do was to set up a team that had a range of different skill sets and personalities. You need economists, statisticians, designers, data visualizers,
00:03:20
Speaker
kind of computer scientists, guys who understood machine learning, and other advanced forms of computational methods, and to try and bring them together to try and solve problems for government. So that's a bit about my background and why I'm doing what I'm doing. That's great. And Chongxian, who I know you've worked on this rogue train project, we'll get to that in a little bit, but could you talk about also how you came to be working at GovTech?
00:03:45
Speaker
Hi, Donja. Once again, thank you for having us on the show. I'm a data scientist with the government technology agency, fairly fresh out of university. So yes, this is my first job. And much of my recent work has really been relating to developing privacy models for the digital products the government builds. And really, my interest is how we can minimize the privacy risk to individuals, even as we go about putting all that government data to work for the public good.
00:04:14
Speaker
Right, right. Well, I know as we as more and more data become open, this privacy issue is more and more important. So thank you. Can we start by sort of getting a background on on GovTech, what it is, where it came from and the sort of projects that you're working on and the and the goals that you have both sort of as your team and overall?
Enhancing Citizen's Lives with Data and Technology
00:04:34
Speaker
Well, sure. So the Government Technology Agency is a fairly new set up, a new organization. It inherited what was a traditional government IT arm. But in the last two or three years, we've also included and kind of created almost from scratch a government digital services team.
00:04:53
Speaker
That's pretty much similar to 18F in the US, and my data science team sits within that government digital services team. We're really trying to drive our smart nation objectives, which is really trying to use data and digital technology to improve the lives of citizens in a whole range of areas, from healthcare, from transportation.
00:05:16
Speaker
So just basic convenience when you're dealing with a government transaction, trying to move more of this kind of online and to improve the user experience. So that's kind of the larger mandate of GovTech. It's a pretty exciting time. It involves kind of building up technical teams within government when traditionally we've done most of our work kind of through outsourcing to contractors.
00:05:43
Speaker
And it's about bringing completely new skill sets into government and getting people who've never thought about working in government, people who spend most of their careers out in the private sector.
00:05:53
Speaker
but who've realized that they do want to do something for the public good. So that's pretty much what we do. Yeah. Now prior to GovTech, prior to two and a half or three years ago, what was the government open data policy or platform? What was there? I mean, I know everything has evolved. Everything's evolved here in the US, but what is the evolution? What do things look like prior to GovTech, prior to you guys coming in and developing this team?
Evolution of Singapore's Open Data Policy
00:06:21
Speaker
so our open data site is at data.gov.sg. It's been around since 2011 and for a long time we've put out quite a large amount of data, both static data as well as kind of APIs from different government agencies ranging from transportation to weather to other services that the government has. But traditionally it's been presented in almost as a kind of database dump or database repository.
00:06:51
Speaker
And what we tried to do in the last two years is to focus a lot more on communicating the data, getting the data out to regular citizens. So you'll see on the site a big focus on data visualizations and charts so that instantaneously people can kind of get into the data, who can explore the data and get an appreciation for it because
00:07:11
Speaker
When you give people a link that they get to but need to download it and it comes up on a CSV or an Excel, it's a pretty high barrier for most people to kind of get in and explore that. And we wanted to lower that, better improve that experience. The second thing we wanted to do was you can do stuff with visualization, but it's still presenting facts. It was hard to kind of piece together
00:07:33
Speaker
a whole narrative, a whole story around the data. And we were pretty inspired by what we saw going on in different journalism outfits and blogs mostly coming out the US. And we said, hey, this is an opportunity for government agencies to get their message out. Of course, they want to be transparent with the data and release the underlying data. But what's the background, what's the context, and what's the story behind all this data?
00:08:00
Speaker
and that's what we set out to try and achieve.
00:08:04
Speaker
Yeah, it's really interesting how the parallels between what's going on there, what's going on here in the US, similar sorts of challenges of communicating data. I wonder if you could talk a little bit about the community there, both inside and outside of government. I know DataKind recently opened a new chapter in Singapore, one of, I think, three. They opened one here in DC, one in San Francisco, one there.
00:08:31
Speaker
Are there meetup groups? Is there an active community? How have you found, now that you're able to communicate data, perhaps in a better way, what does the community look like when you go out and talk to people? There's a whole range of different meetup groups and there's tons of interest into data science in Singapore. I think the largest group is one called Data Science, or SG, with about 5,000 members. Yes, you're right. There's a data science SG chapter, and that is about 2,000 people.
00:08:59
Speaker
There are a range of other meet-up groups. I think they meet pretty regularly. I was just at two data science and SG events in the last months. It's a really interesting and diverse community of various types of professionals and people working in data or interested in data. We've got people from banks, from telcos, from healthcare, from research communities. A lot of them are really practitioners.
00:09:28
Speaker
and trying to learn more about what everybody else is doing, what the latest technologies are, and how they can get more out of the data. So it's a pretty vibrant community. I would say not that many data visualization specific meetup groups, but in a lot of the data science and big data meetups, there's always a focus on data visualization.
00:09:51
Speaker
One thing we did recently was to try and get together a community of journalists. So we have the little half-day workshop for journalists, some of whom from more traditional print media, as well as people from their kind of digital desks. And we just sat them together in a really cool brainstorming and sharing about how to tell good data stories, whether this is the technology tools that are out there, both for advanced users or for somebody who's never used, never written a line of code before.
00:10:21
Speaker
And it's really bringing together the different worlds from people who traditionally have been English lit majors writing stories and trying to teach them how to code or bringing some of these programmers and getting to think about how they want to tell their stories and visualize it. So it's a really eclectic mix out here in Singapore.
00:10:43
Speaker
Do you get the sense that the fact that you've been able to open more data and provide it in sort of easier ways has opened the floodgates for people to become more engaged with the data and to explore Singapore in sort of a data-driven way that they weren't really able to do so before? We hope so. But I think it's really a process of encouraging agencies and corporates to be a lot more transparent about the information they put out. I think Singapore is at the stage
00:11:13
Speaker
We're fairly a young country, but I think we're at a stage where citizens demand a lot more from their governments. Not just in terms of great outcomes and efficiency and everything being well run, but in terms of the process, how did the government go about doing this and being very transparent about the facts and the approach. We pride ourselves on being very evidence-based, very efficient and delivering services. I think open data is just one part of that.
00:11:38
Speaker
We've seen a lot of interest. I mean, just the analytics on our blog and our site showed a kind of 1024 increase in kind of viewership from the US and from Europe. So traditionally data.gov has been kind of more kind of a domestic audience and we've seen a lot more people interested in this. And lots of people have kind of reached out to us. So hopefully it drives a larger community and a greater interest in this. But it also requires people to
00:12:06
Speaker
to be curious and to learn new skills and take new approaches that they might not be financially inclined to. So getting the designers, the journalists, and the developers to talk and to engage in that creative process. Yeah, I think that's one of the big challenges. Getting all these different groups to talk to one another is a huge challenge. So it's
Rogue Train Project Using Data Visualization
00:12:29
Speaker
in some ways comforting to hear that that challenge is a problem or a challenge everywhere in every country, every culture.
00:12:35
Speaker
So I want to talk a little bit about this Rogue Train project. Sean Cien, I get a sense that you worked on this project and there's a great blog post, but I want to give you a chance to talk about the challenge, talk about the project and how you and your teams approached it. And I think it's a really interesting case of data solving a problem. So can you tell us a little bit about this Circle Line Rogue Train problem?
00:13:01
Speaker
But thank you, John. It's really great that you have us on the show, actually, because this road train case was really a great example of how the right visualizations can make the difference between getting the insight you want or getting lost in your data, especially if your data has many dimensions. And for every non-trivial problem, you would expect that your data has many, many dimensions. Well, before I go more into detail on the case, for the benefit of your audience, less familiar with Singapore, this road train
00:13:32
Speaker
train incident was something that happened in October, November last year, where we experienced a series of continual and persistent breakdowns on Singapore's metro system, specifically the circle line of the metro system. Trains were frequently experiencing disruptions to a wireless communication link they had with the central control system. And what happens when this wireless link is disrupted?
00:13:56
Speaker
But the train automatically deploys an emergency brake as a safety measure. And as you would expect, this continual emergency braking has led to some quite serious delays in the train system. And this happened over a period of five days or something like that. What made this situation a little different from other breakdowns? You know, Singapore's real network is not perfect. And it does break down from time to time. But what made this situation
00:14:20
Speaker
Quite different from other incidents was that real engineers were unable to diagnose this cause of the wireless communication disruptions and that's why it persisted over a particularly long period.
00:14:32
Speaker
We actually made quite a late start to the problem. Before we got involved, real engineers had already spent a few days trying to diagnose the situation. They explored many relationships, like the frequency of breakdown of the trains with respect to time, with respect to location, train number. But they didn't manage to find any meaningful connections between the breakdowns and all those factors.
00:14:56
Speaker
What really held these initial investigations back was the way they approached their data visualizations. Most of the relationships they attempted to explore were presented on both two dimensional plots. For example, the frequency of breakdown versus time.
00:15:11
Speaker
the frequency of breakdowns versus the location of the train. And this really made it difficult to spot trends that span across multiple factors, like the secondary relationship between the location and the time of the breakdowns. So how we did things differently was to attempt to put as much data as possible into a single plot. We started
00:15:31
Speaker
with a fairly small data set actually, although people thought that we were dealing with pretty large data sets. We started with a CSV with slightly more than 200 lines and maybe five or six columns. This was the log of disruption events by train, location and time. And what we did was we plotted these incidents on a plot that included location and time on the axis and
00:15:57
Speaker
color-coded, all the points according to the train affected. And this really allowed us to notice a very interesting pattern where the disruptions seem to be occurring in sequence along the tracks moving from station to station. Well, if you're given some thought, there are very few things that move along train tracks. And one of those few things that move along train tracks are trains, you know. So this suggested that moving, that a train was the cause of this disruption.
00:16:27
Speaker
With this thought in mind, to confirm our suspicions, we calculated the speed of this hypothetical moving disruption source and to no one's surprise really, this moving disruption source had a speed which matched the speed of a train. With this information on hand, we looked up the operational records of all the trains and we found a single train whose location and time corresponded to virtually all of the incidents
00:16:56
Speaker
So on that same day, we went and made our case to the Ministry of Transport and the train operator, and they made a decision to send that faulty train on the tracks for a test run. And during that test run, well, wherever that particular train went, it caused all these disruptions as it moved up and down the line. So with all this evidence, both from the data's perspective and that live test run that day, the Ministry then was able to make that decision to take the trip in off service. And after that point, all the disruptions stopped.
00:17:26
Speaker
Yeah, I mean, it's an amazing story of using data to uncover this sort of challenge. I'm curious, did your team and the engineering team ever sit down and talk or did the engineering team sort of do their analysis and sort of come up a little bit empty handed and then say, okay, why don't you give it a shot?
00:17:47
Speaker
I would say we worked fairly closely actually. We came to the problem knowing nothing about trains and how the train system worked. While we could spot the patterns, we needed them to close the loop on explaining why this particular train would cause the destruction it caused.
00:18:05
Speaker
Even though we could say that yes, this train is highly correlated with disruption incidents, we need real engineers to really get down there and explain the data and really narrow it down to hardware or software that led to this disruption. So I would say it was a fairly close collaboration between the data team as well as the real engineering team.
00:18:23
Speaker
Yeah, I'm fascinating because it's such a concrete example of having a direct effect on policy, a direct effect on people's lives. What was the feeling that you had when you sort of came to this realization that, oh, we're really honing in on this one thing, we think we figured it out, and then to have that sort of verified in an actual sort of engineering on the ground test? What was that feeling like? I'd like to contrast the feeling with that feeling which I'll describe shortly with the initial
00:18:52
Speaker
thoughts I had about approaching this. When I first started this, I got a text message early in the morning at seven that I should narrow down the course of these train breakdowns with a 200-line CSV without thinking. I was wondering
Impact of Data Science on Real-World Projects
00:19:09
Speaker
if this was a joke. At first, when they sent the CSV over, and I told him,
00:19:16
Speaker
Oh, thanks for the data sample. Can I have the full data set? Just the full data set.
00:19:24
Speaker
So we really started out being nothing in mind, putting what we thought was fairly sensible visualization techniques and later we put in some clustering techniques as well to confirm our suspicions. We didn't manage to zoom in on the data and zoom in on the course, the potential course for the breakdowns. And that was a fairly exhilarating experience. As a data scientist, a lot of the work I tend to do is fairly upstream.
00:19:51
Speaker
I'm somewhat removed from the effects, the downstream effects of this. And this is really the first time I felt that data analysis, data science was having a real effect on the real world. It also matters that I take the circle line myself to work every day. So I had a direct stake in this. So John, just to add a bit more context.
00:20:13
Speaker
I think it was the Wednesday or Thursday when you were stuck in the Circle Line for about two hours and a whole bunch of my team were basically late coming in because they took the Circle Line. Basically, I was in Japan for a conference and I got a call early Saturday morning, late hours of Friday night saying, could you get some of your guys to help? Because it was a couple of people I knew from the ministry who were stuck.
00:20:42
Speaker
And I was initially pretty skeptical as well, but I said, okay, I'll look up some volunteers. At 7 a.m. on a Saturday morning, I just pinged our Slack group to say, anybody want to volunteer to help, you know, save the Circle Line? And yeah, surprisingly, there were a bunch of people who just put up their hands immediately. One Saturday, right? Yeah, exactly.
00:21:09
Speaker
So the timelines were pretty short. So they got into the office, I think, about 10 o'clock. And by 4 or 5 o'clock, they had to make a presentation to this big committee of engineers and officials. And they had to kind of make the case. So it was pretty much four or five hours on a Saturday afternoon that they had.
00:21:35
Speaker
One point I should make was that we also had advocates within, as Chan Tien said, we came in pretty late, but we also had advocates within this big committee that was trying to solve this particular problem. In particular, there was somebody from the ministry who was convinced that we should take a look at the data more closely, and he was willing to kind of back the technical team up. So he was our advocate, he was going to the committees, making the presentations,
00:22:02
Speaker
you know, helping the team kind of shape how they should present this to the policymakers. And that's an important point because you do need an advocate from somebody who's within the team. When
Visual Data Representation in Decision-Making
00:22:13
Speaker
you're presenting new data or new hypothesis, sometimes, you know, the initial reaction can be one that's a bit more kind of reluctant. So that advocate within team is absolutely crucial.
00:22:26
Speaker
Yeah, I think that's a great point because I was about to ask you. I mean, the engineering teams presumably have been around for a long period of time, you know, building train lines and managing train lines and and GovTech. You've been around for two and a half or three years. It's sort of a new kid on the block. It's new technologies, new way of.
00:22:43
Speaker
of approaching these issues. And I wonder what are the strategies that you found that resulted in success? And you've mentioned advocates, you've mentioned sort of this enthusiasm. Are there other things that you saw or got a sense of that resulted in success on this or other projects? I think one point to make is that I think the great thing about working with other types of engineers is ultimately there's still pretty much fact-based and evidence driven.
00:23:12
Speaker
The train networks are incredibly complex, by the way, and we only looked at a small data set of the outcomes. There was huge amounts of sensor information within the train network that we might have had to go to if we couldn't find a cause just from this 200 rows of data sets. But the system is incredibly complicated. There are many moving parts to this.
00:23:31
Speaker
And I suspect it was just that we provided a kind of fresh lens. And I think they were maybe a bit too much stuck in the weeds and it was almost kind of too much information flowing at them. And I think at some points they were led down a kind of line of inquiry that kind of was a distraction. And we were in some ways fortunate to have kind of spotted something and then kind of having to bring them back on track. The second thing is really data visualization. Because when you've got a committee of 30 or 40 people,
00:24:00
Speaker
having an image visualization like the ones you saw on our blog, I mean, there was a different form of that, that just kind of tells the story and tells the picture and makes people kind of sit up and say, you know what, there might be something here. And to do it in a kind of visual way really grabs the attention of a large group of people very quickly, even the skeptical ones. Right. And that's the power data visualization that we presented it in kind of a table of numbers.
00:24:27
Speaker
it might not have had that effect. So a picture is very powerful in this context. And we found that in a lot of other work as well, that you've got a number of different visualizations, but there's one chart, one key chart that the money chart, right, that tells you that either sells the case or gives you that final insight. And it's how do you construct that chart, as Shansian said, to be with as many dimensions as possible, as dense as possible, as rich as possible, but yet clear.
00:24:55
Speaker
And that's the art of maintaining that balance between as much information as possible, but still having a clear narrative of it or clear message that you want to get just from that one shot.
00:25:07
Speaker
Absolutely. Absolutely. I don't think I
Closing Remarks and Future Discussions
00:25:09
Speaker
could have put it better myself. Um, I would love to dive into more of this. Um, and so what I want to do is invite you to come back on the show. Um, maybe later this spring or summer, we'll talk about some of your other projects because I have this great sense that you're doing some great work. It seems to be very exciting time there. So thank you on Shenzhen. I want to thank you so much for coming on the show. It's been really fantastic to have you on. Thank you, John. Thank you, John. Thanks so much. And thanks to everyone for tuning into this week's episode.
00:25:36
Speaker
Next week, we'll have some more fantastic guests to talk about open data, data visualization, and some more research issues as it applies to data and open data and public policy. So until next time, this has been the policy of this podcast. Thanks so much for listening.
00:26:02
Speaker
This episode of the PolicyViz podcast is brought to you by JMP, Statistical Discovery Software from SAS. JMP, spelled J-M-P, is an easy to use tool that connects powerful analytics with interactive graphics. The drag and drop interface of JMP enables quick exploration of data to identify patterns, interactions, and outliers.
00:26:22
Speaker
JUMP has a scripting language for reproducibility and interfacing with R. Click on this episode's sponsored link to receive a free info kit that includes an interview with DataVis experts Kaiser Fung and Alberto Cairo. In the interview, they discuss information gathering, analysis, and communicating results.