Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Episode #55: William Cleveland image

Episode #55: William Cleveland

The PolicyViz Podcast
Avatar
141 Plays8 years ago

Welcome back to The PolicyViz Podcast. Number 55 if you’re keeping count. This is the final episode for the summer; I’m going to take some time off from recording, editing, and setting up before coming back in the fall with...

The post Episode #55: William Cleveland appeared first on PolicyViz.

Recommended
Transcript

Introduction and Program Promotion

00:00:00
Speaker
This episode of the PolicyViz podcast is brought to you by the Maryland Institute College of Art. MICA's professional graduate program in information visualization trains designers and analysts to translate data into compelling visual narratives, benefit from the resources of a premier College of Art and Design while learning online.
00:00:20
Speaker
Earn your information visualization degree in just 15 months. Expert faculty includes Andy Kirk, John Schwabisch, Marissa Peacock, and Rob Rolleston. Learn more at mica.edu slash MPSInvis.

Podcast Introduction and Guest Welcome

00:00:46
Speaker
Welcome back to the Policy Viz Podcast. I'm your host, John Schwabisch. If you're keeping track, this is episode number 55. This will be the final episode for the summer, and I'll be taking a nice break and coming back with more great guests and more exciting shows in the fall. But to wrap up the summer, I'm very excited to have with me today on the show, William Cleveland, who is the Shanta S. Gupta Distinguished Professor of Statistics and courtesy professor of computer science at Purdue University.
00:01:15
Speaker
Professor Cleveland, welcome to the show. I'm pleased to be here, John. Thanks for inviting me. And what you're doing is not only very informative, but lots of fun. Oh, great. Well, I'm glad

William Cleveland's Achievements in Data Science

00:01:25
Speaker
you could come on. This is really exciting. Let me quickly, for those who aren't familiar with you and your work, let me quickly run down.
00:01:32
Speaker
Professor Cleveland has been at Purdue University since about 2014. Prior to Purdue, he was at Bell Labs in New Jersey. He's won numerous awards, numerous recognitions, has published hundreds of papers on visual perception, visual methods, deep computing, big data, data science, the list goes on and on. The most recent award, of course, being the American Statistical Association Computing and Graphics Award that was passed out this past January.
00:02:00
Speaker
Now, importantly, Professor Cleveland is also credited with defining the term data science, and that's where I think we want to start today.

What is Data Science?

00:02:07
Speaker
So, what would you say is data science now? Okay, yeah, that's the most important thing of all. What is it? What are we talking about here? And this gets a lot of discussion, and I find that much of it seems to be that there's just too many words. It's a list.
00:02:26
Speaker
This list can be summarized by saying it's everything that comes into play and can reasonably be affected by people who are methodologists for analyzing data. So it's all about data. What are all the things that we have to be concerned with when we analyze data? What would the list be? Well, certainly statistical methods, machine learning methods, statistical models,
00:02:52
Speaker
Statistical theory, theory is extremely important. If you can come up with a good theory, that can have a huge impact on the way people analyze their data. It includes also, and this is key, it includes doing data analysis. This list of things that are important for data analysis and developing methods
00:03:15
Speaker
and systems includes data analysis because if you're actually working in data analysis and coming up with things for data analysis, it's a very good idea for you, well let's say an organization, to have people that are actually analyzing data. But then the list continues on into computing.
00:03:36
Speaker
So there's a number of aspects of the

Computational Systems in Big Data

00:03:38
Speaker
computing. One certainly is the implementation of statistical and machine learning methods. You have to write code to implement those things. A second major area is computational systems.
00:03:53
Speaker
for data analysis. So what do I mean by a system? I mean something that starts up front with a data analyst putting fingers to keyboard and typing commands to carry out analysis. Let me just say today what this also means and what is terribly important and that is that at the back end you have a parallel distributed
00:04:19
Speaker
computational environment, and I suppose the best known these days is Hadoop, although another one called Spark is providing substantial competition for Hadoop, but it's a system that enables doing parallel computing.
00:04:34
Speaker
with the data. So, and obviously this is targeting being able to carry out analyses of the data when the data are big. So there's a lot of different kinds of approaches for exploiting these systems, but the actual putting together or using this system, I'm not suggesting that data science includes building something like Hadoop. For these system things, we grab things off the shelf and put them together to form a computational environment.
00:05:02
Speaker
But being able to write the software that exploits this to do data analysis is an important part of the whole computational system for the purpose of data analysis. So that pretty much covers what is I think a pretty good definition and the extent to which it's getting done varies tremendously.
00:05:24
Speaker
Depending on what sector of the fields of scientific and engineering fields you're talking about. Now, it's a pretty long list. It covers sort of a whole life cycle, if you will, of working with data, analyzing it, computing with it, visualizing it. And I'm curious,
00:05:42
Speaker
When you came up with this term in 1999, 2001, how did you come up

Origin of 'Data Science' Term

00:05:51
Speaker
with the term? How did you define that term initially, sort of giving birth to this phrase, data science, which now is kind of the big buzzword? Yeah. By the way, I'm sorry. I made a terrible mistake in that list. I forgot to say visualization.
00:06:06
Speaker
All right, we got that, so we can check that box. All right, yeah. So the theme of the paper was saying basically, we've got to add the computing part of this. And I thought, well, let's call it something else. And how I came up with beta science, I really couldn't tell you. Just came to you. That just happened. I was not aware, even at that time, I was unaware of other uses. But the other uses actually meant rather different things. The term was not new.
00:06:33
Speaker
But later on I found out, you know, well as Wikipedia started to pick these things up, that it had been used earlier but for rather different definitions. What I just said it is, while I wasn't as detailed as this originally, this is basically what the definition means, at least as I put it up there. And I think it's pretty much what most people mean these days.
00:06:55
Speaker
Now at the time you were at Bell Labs and you had been at Bell Labs for a while. So what were the sort of things that you were working on that gave birth to this idea of adding computing to form this term data science? Yes. Well, it really began when I was a graduate student at Yale.
00:07:11
Speaker
I was programming in Fortran. Actually, I'd been doing that as an undergraduate, Princeton, using Fortran, but there I was at Yale. I was competing with card decks. You'd type on this card deck, the creator, and write the Fortran code and then that card deck, which was a very tedious process, by the way, and then it would go into a computer center and you'd wait
00:07:35
Speaker
anywhere from four to even eight hours for the output to come back from a teletypewriter.
00:07:46
Speaker
printer. And if you had a comma that wasn't supposed to be there, then the program failed and you had to go through the process again to correct the mistakes. So the use of the Fortran for data analysis and the long wait time was a major factor, although it's the way it was. So, you know, you just said, well, that's life. But then suddenly there appeared in the math building
00:08:09
Speaker
a connection that went to the Yorktown Heights location of IBM where they had created a new interactive system. And it had a language called APL. And as a matter of fact, this had associated with it a statistics package that had been put together by a guy named, I think it was Smythe. Anyway, somebody, he was from Canada. So I heard about this and I thought, well, this is interesting. I went and started using it and I was stunned at, you know, if you made a mistake, it came back immediately.
00:08:37
Speaker
You know, you'd get an interactive message immediately saying, this is wrong, you know, in seconds. And so, and you could do it line by line and, you know, do things very interactively. And actually, it was an operator language. You had operators that you use to do the computation.
00:08:57
Speaker
In the commands, writing the commands for the data analysis was much, much easier than the FORTRAN. And as I said, getting the results back was very fast. I remember going to Jimmy Savage and saying, the people who have the biggest impact in statistics in the future are those that know how to compute.
00:09:13
Speaker
So he didn't quite accept that and his reply was his pride took the form of he said, well, Bill, you know how to compute. Well, of course, I mean, I didn't really think of myself as being able to compute. I mean, I do some Fortran and I do some APL, but, you know, I think it was just his way of saying, well, maybe it's not quite that important. I don't know.
00:09:34
Speaker
It's hard to know. Savage was a genius actually. For those that might not know, he's the one that really got, he created the Bayesian revolution starting in about the 50s. So right then and there I thought it was important. And then eventually I went up at Bell Labs where they had been doing this data science as I described it from very early on. So when I read that paper in 2001 and I said data science, I was using that term for the first time for that list I gave you.
00:10:03
Speaker
I don't think of coming up with a name as being very remarkable. I mean, people say, wow, you had a lot of forward thinking there. I just don't feel that way. I said, it's just a term I came up with, and we were already doing it at Bell Labs, and when I got there, they were doing it. So that's the remarkable part, the fact that Bell Labs was doing it, as far as I'm concerned.
00:10:21
Speaker
Yeah. So let's turn a little bit to visualization because perhaps one of the papers that you've written that a lot of people in the data visualization field are familiar with is the paper you wrote with Robert McGill on graphical perception.

Trellis Displays in Data Visualization

00:10:35
Speaker
But you've been doing more recently over the last few years, you've been doing more work on visual methods. And of course, one of the tools or products that you created was a trellis display. So can you talk a little bit about the trellis display and how you devise that and how you've been implementing that over the last few years?
00:10:51
Speaker
Sure, yes, I can do that. Let me just step back a little bit though and let's go back to Yale again. Okay. The first exposure I had to ideas of data visualization came from Frank Anscombe, who was one of the pioneers. I mean, John Tukey, Frank Anscombe and Cuthbert Daniel were people that said, you know, we got all these models for data and we have a wealth of methodology that we can use to fit models and then to make inferences.
00:11:19
Speaker
But the problem is that there is, at this point in time, not much in the way of ideas, methods for checking to see if the model actually fits the data. And they said this is a big deficiency. And one of the ways these guys argued for it was to take other people's analyses and show that the whole thing had been boxed because the model didn't fit the data.
00:11:39
Speaker
And both Cuthbert Daniel and Frank Anscombe were pretty ruthless about that. They didn't hesitate for one second to say, well, this thing is all wrong and ridiculous because they got the wrong model. So when I sat in, of course, Anscombe sold it. I mean, it was very clear. So that was my first exposure. Well, John Toogie was a big factor at Bell Laboratories because he was
00:12:01
Speaker
He was at both Princeton and Bell Labs. So he certainly was responsible for the Bell Labs getting involved in data visualization. So there was already people, when I arrived, there was already people doing graphical methods as they called them then.
00:12:16
Speaker
So I got involved in that and so started thinking about methodology and doing new things. And that became one of my major tracks of research. But then there reached the point where I said there's a lot of principles. A lot of it is how you actually render the display and the visual perception associated with it that enables you to make conclusions about what you see. Needed study.
00:12:43
Speaker
It needed to be more thought. Actually, John Tukey had written a little bit about it, but he didn't really ever say visual perception or anything like that. He had some techniques to enable you to be able to better perceive things. They were more methodological. So, for example, judging slopes of lines, that was something that seemed to me to be terribly important. I could tell that from examples that we had.
00:13:04
Speaker
It seemed to me that there needed to be work in this and so I just got going. I got some colleagues at Bell Labs and when I first started it, I mean, it wasn't let's just say people would look at me and go, huh? What are you doing? Of course, I was surrounded by people in computing statistics so it was like, you're going to do visual perception, are you nuts?
00:13:25
Speaker
But we did have a famous visual perception guy, Bella Ewes, at Bell Laboratories. So I went and talked to him, you know, and well, he thought it was fine. He thought it was a good idea. But anyway, so we started doing it and were able to come up with results, you know, compelling results for, you know, some some ways of encoding the data on a display actually work much better and remarkably better than other ways to do it. So, you know, and it was it was perceptual. So
00:13:52
Speaker
We just started writing papers and then other people got involved and thought about it also.
00:14:00
Speaker
So that was a line of research and then you moved on to more methods. Is that right? I was doing methods at the same time. Okay, okay. I actually wrote a book with three other people. Actually, I've been involved in three books in data visualization, but the first one was a team of us at Bell Labs who wrote the book, but it was too encyclopedic.
00:14:23
Speaker
It did okay, you know, and it was very early on, you know, in retrospect, I hope, well, I'm going to be honest, I think it's too encyclopedic. It just didn't really tell the story. So, and subsequently, I wrote, I wrote two books after that, The Elements of Graphing Data and Visualizing Data. So, in there, I really did try to tell the story, you know, through lots of examples. Everything starts with an example.
00:14:48
Speaker
Yeah, and as we know, now storytelling is the big emphasis on data visualization. So you could see that coming through already. But on the methods then, when did the trellis display work start and where did that come from? Well, actually, it's Rick Becker and I put together something, it was an interactive tool that we called brushing.
00:15:13
Speaker
And what would happen is that, you know, you have two plots. One of them, y against x1 and the other, y against x2. And we had this interactive tool where you could form, say, a rectangle and condition on x1 and it would highlight the points, you know, on the y versus x1 plot, it would highlight
00:15:36
Speaker
points that were a certain range of values of x1. And then you could see the results on the plot of y equals x2. So effectively what we were doing is we were conditioning. We were conditioning on x1 and seeing what the dependence was on x2. So this worked out, but as time went along it began to realize that, well,
00:15:58
Speaker
Boy, first of all, it doesn't extend to very many variables and it required sort of remembering what it was you saw. You know, this was a dynamic display. I mean, it would change through time and so you didn't get to see the previous stuff. So we said, you know what, we should actually
00:16:16
Speaker
start putting together a visualization system that does the conditioning and then makes the plot. But it comes out as, let's say, hard copy, you might say. So that's when we took this conditioning idea and made it be something where you specified. We had a language that we created for doing the specification. In so doing, you basically broke the data up into subsets because you condition on intervals of x1, x2, x3.
00:16:45
Speaker
And then we had a way of being able to lay it all out. So each subset of the data from doing the conditioning collectively had a, you know, you apply a visualization method to it. And then there would be panels. Panels were created, you know, and there were rows and columns and pages of panels. And so it was easier to absorb the information. You had, you know, you had the whole thing there in front of you.
00:17:10
Speaker
You might have to go across pages, and of course that was something of a limitation. The visual system works best if you're comparing two different plots. It's better if it's in your visual field. If you have to look at plot number one and then you look at plot number two separately, the visual system can't do as good a job as comparing the patterns on the two plots. So you try to do that, but that's not possible to do if you get a lot of subsets of the data.
00:17:37
Speaker
Anyway, it worked out quite well and was implemented in S and S Plus and went off as part of the commercial part of the S language. By the way, I can't tell you how marvelous it was to be in the place where the S language was developed. In fact, I can remember the first meeting. It was in the office next to me. I wasn't involved.
00:18:01
Speaker
Again, I felt very lucky to be at Bell Labs because of S. We were off and running before anybody else was able to do it.
00:18:10
Speaker
Yes, so this led to Trellis display. That first implementation of it, which was basically done by Rick Becker, and then later on Alan Wilkes joined us, that was a very challenging computing effort, I want to tell you. So again, we're back to the computing being important. I mean, there would have been no real Trellis display without that computational system, which is complex.
00:18:36
Speaker
uh... then later on the sarkar came along and did lattice graphics uh... which is chalice display with a different name as it was worried about trademark stuff and um... and the pion did a beautiful job in his implementation but it was easier because he had a much and easier to program underlying graphics uh... systems right now when you were developing it were you thinking at all about sort of big data you're doing a lot of big data working and you've obviously mentioned computational performance but were you thinking
00:19:05
Speaker
About the size of the data and had you sort of foreseen what was sort of coming that we're seeing now?
00:19:12
Speaker
No. There was no notion whatsoever. It was simply a way that we thought of as being a very good way to analyze the data. It's as simple as that. And in some cases, breaking up the data is obvious. Maybe I should mention this right here and now. When you do the conditioning, sometimes it's, well, let's just take the example. Let's suppose you have 10,000 banks around the world and you have financial data.
00:19:40
Speaker
you know, a lot of financial variables from each bank. And so your data analyst, you say, well, how should I start looking at these data? Well, you're going to break it up according to bank and look at all the banks individually and make plots of the financial variables for each bank and then take a look at it. That's just a natural thing to do.
00:20:01
Speaker
And people were doing that. And that's going to suggest that somehow we invented conditioning on variables. That's just not the case. We exploited it because we developed ways to condition on any kind of variable, not just categorical things. But it was already in play as a best practice for analyzing data. But now let's jump to big data.
00:20:23
Speaker
Because you're doing a lot of work with big data now, both in the tool that you're working on, Tessera, and also a lot of security and cybersecurity work, and I'm sure lots of other projects you have in the hopper as well, correct?
00:20:35
Speaker
Yes, that's exactly right. So I don't think there's been a period in my career when I haven't done data analysis. Again, this is this notion that if you're going to do research in data analysis, it's a good idea to do it yourself. Actually, you get a lot of ideas, too. It's the best heat engine for ideas for research in data analysis is to actually do data analysis. You go, hey, I'm going to try this. And then you say, you know what? I think I can generalize this to other cases.

Divide and Recombine Method

00:21:02
Speaker
So with big data, we have an approach.
00:21:06
Speaker
to big data that we call divide and recombine. And when I tell you what it is, you're going to say, whoa, gee whiz, that sounds like Travis's display. It's exactly what we do. We break up the data in the subsets by conditioning. Well, actually, there's different ways to do the subsetting. I'll say a word or two about that. But you condition on variables important to the analysis and then do analyses. Now we're doing it not just because it's a good best practice for data analysis, but for the computing.
00:21:36
Speaker
OK. So now enters into it parallel distributed systems. So you take a 10 terabyte data set, and you break it up. And not many people have 10 terabytes of memory. And so you start analyzing it by breaking it into subsets. And one of the major methods that we have is this trellis display-like thing, where you're breaking it up according to variables important to the analysis. We call that conditioning variable division.
00:22:05
Speaker
And that's a very major way that we ourselves operate. So the idea when it's this conditioning variable division, like analyze data from all the banks, what then we do is to have a recombination method that takes that information in one way or another, combines that information to provide you
00:22:25
Speaker
with what it is you want to see in the data. Now, there's also another way of doing the breakup that's somewhat different. Namely, let's suppose you're doing a logistic regression and you've got 20 predictors and there isn't any natural, you don't say, well, there's some conditioning variable here because what I want is in the end, I don't want to see 10,000 subsets the way you would want to do for the banks. What I want is I just need to be able to compute
00:22:55
Speaker
And I want one answer when we're all done. So you break the data into subsets, you do your logistic regression on every single subset, you take the output of that and then you recombine it to find, to produce a single answer. So we call all of this divide and recombine.
00:23:14
Speaker
OK, whether it's conditioning variable division or whether it's this, we call it sampling division. Because you can actually do these divisions just by sampling. If you're doing logistic regression, you break it up into subsets according to the, you know, each subset is the rows of your... Some characteristics of the sample.
00:23:33
Speaker
Yeah, actually there are division methods that can help the statistical accuracy. The idea is to try to make the statistical accuracy from this as close as possible to what you would have gotten had you been able to compute on all the data. Right.
00:23:48
Speaker
So that's a major area of research for us. So we actually do that sampling division. It's very challenging, by the way. It's not so easy, but we're making headway. And the recombination method has to, you have to pay attention to computation. I mean, it's possible to come up with recombination methods that work marvelously, but you've sort of cheated. I mean, you have to do more computation than maybe than what you had to do if you were trying to analyze all the data all at once.
00:24:16
Speaker
You have to keep an eye on the computational and complexity and intensity of the recombination method. Right. Now, the team that's working on TASRA and some other projects, it's an open source tool, but who sort of makes up the larger team? At the moment, it's me, some other professors at Purdue, Bo Wei Shi, for example, other people. It's graduate students of ours.
00:24:43
Speaker
It's people at Pacific Northwest National Laboratory. They're out in Washington. They're one of our national labs like Argonne and Los Alamos and whatnot. There is also Ryan Hafen, who he was at PNNL and now he's sort of an independent consultant. So that's pretty much our team today. Yeah, a pretty wide team.
00:25:03
Speaker
Yeah, one person that was on the team originally, he's not active on our project at the moment, was Subtarshi Guha, who's at Mozilla. We were actually working on building a cybersecurity algorithm and we had millions of, you know, a couple of million connections, SSH actually, which is, you know,
00:25:26
Speaker
a way of SSH creates logins. You log in with SSH for those, I know, from one computer to the next and then have a session. Well, there's also another instance of SSH, which is you just do a file transfer, but it's the same technology underlying it. Anyway, we were doing research trying to find an algorithm that would enable us to detect keystrokes, not what the keystrokes are, but rather that there was a keystroke.
00:25:53
Speaker
We had this algorithm, we had to apply it to millions of things, and it was a nightmare to try to get it to compute. And then one day, Subtershi comes to me and says, you know what? There's a new thing that just was put out by Yahoo, and they call it Hadoop.
00:26:08
Speaker
It will do parallel distributed computation and I think I can write software that will enable us to connect R with this Hadoop system. And he said, do you think this would be a good thing to do? And I said, please leave my office and go and get this done as fast as you can.
00:26:27
Speaker
So it was Saptarshi, he's the one that discovered this and it wasn't very well known at the time, I'll tell you, because it was, it eventually did become very well known once, you know, Yahoo moved it out to other companies started up and started, you know, doing Hadoop and, you know,
00:26:45
Speaker
Well, clearly an important tool as the data get bigger. I want to ask one last question before I let you go.

Challenges of Big Data

00:26:53
Speaker
I noticed on your website that you make a point of talking about how big data doesn't necessarily mean large data sets or number of observations. Yes, I should have said this earlier. Big data is really a lousy term.
00:27:09
Speaker
to be the key words for the computational challenges we face today. The reason is that there are certainly other factors that create computational challenges besides the size of a set of data. It's obvious, once you start thinking about it, one is the computational complexity of the analytic methods that you're using. That's a very big factor.
00:27:34
Speaker
if you do some very simple procedure on the data. Let's say you break it up into subsets. Forget it. You don't break it up into subsets. You've got a relatively small data set. Maybe it's only 500 megabytes. But if the computational methods that you're using are extremely intensive and have very high computational complexity, that can be a real problem.
00:28:00
Speaker
The other thing is the hardware environment you have for doing computation. I mean, again, that's pretty obvious, right? I mean, the more hardware power you have, the larger dataset, you know, you'll be able to analyze larger datasets.
00:28:17
Speaker
We find ourselves using our divide and recombine methodology for data sets that are only a half a gigabyte. Because the computational methods we apply take a long time. Right. It's a method that's taking up the time and the processing power, not necessarily the number of observations.
00:28:34
Speaker
Exactly. So there's three things. There's the size of the data, the computational complexity, and your hardware power. I mean, it's pretty much true that if you have a cluster and you have five nodes and then you go off and somebody says, yeah, here's some money. You can buy five more nodes. You're going to be able to run things faster. That's all there is to it. Part of it is sheer feasibility.
00:29:01
Speaker
I mean, if you compute serially, then you certainly can't analyze a data set that's bigger than memory. Yeah. Because you can't even fit the data inside. So this divide and recombine, you run the subsets. There's only a certain number running at the same time. And then you're reading the subsets in as the computations finish for the subsets that the computer was computing for you. So that part of it is sequential.
00:29:29
Speaker
So obviously, the more cores you have running this, the faster it's all going to go. So the power does really matter a lot.

Episode Conclusion and Listener Engagement

00:29:40
Speaker
Well, this has been great, a tour through the data visualization and computational method space. Professor Cleveland, thanks so much for coming on the show. It's been really great chatting with you. OK, thanks, John, for inviting me.
00:29:55
Speaker
And thanks for everyone for tuning in to this, the last episode of the Policy Viz Podcast for the summer. I'll be back in September with a whole new slate of great guests. Probably change things up a little bit. I'll probably go to an every other week format, maybe with some special bonus episodes. Just not sure I can keep up the weekly pace going forward.
00:30:14
Speaker
If you're interested in sponsoring the show and helping me keep the lights on with the sound editing and the website support, please do get in touch and let me know if you or your organization would like to sponsor the show. And finally, please do rate the show on iTunes. Let others know about it through your comments, through your reviews, and your, hopefully, five stars. And if you have comments or suggestions on the show, things you'd like to hear, guests you'd like to hear from, please do let me know on the website or on Twitter or via email.
00:30:40
Speaker
So until next time, until the fall, this has been the Policy Viz Podcast. Thanks so much for listening.
00:30:58
Speaker
This episode of the PolicyViz podcast is brought to you by the Maryland Institute College of Art. MICA's professional graduate program in information visualization trains designers and analysts to translate data into compelling visual narratives, benefit from the resources of a premier College of Art and Design while learning online.
00:31:18
Speaker
Earn your information visualization degree in just 15 months. Expert faculty includes Andy Kirk, John Schwabisch, Marissa Peacock, and Rob Rolleston. Learn more at mica.edu slash MPSInvis.