Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Episode #153: Claus Wilke image

Episode #153: Claus Wilke

The PolicyViz Podcast
Avatar
200 Plays6 years ago

Claus O. Wilke is a computational and evolutionary biologist and chair of the Department of Integrative Biology at University of Texas at Austin, where he is the Dwight W. and Blanche Faye Reeder Centennial Fellow in Systematic and Evolutionary Biology....

The post Episode #153: Claus Wilke appeared first on PolicyViz.

Recommended
Transcript

Introduction and Author Background

00:00:11
Speaker
Hi, everyone. Welcome back to the Policy Viz podcast. I'm your host, John Schwabisch. On this week's episode, I get to sit down and chat with Klaus Wilke, whose new book, Fundamentals of Data Visualization, has just been published from O'Reilly.

Open Source Trend in Publishing

00:00:24
Speaker
It's a great review of the fundamentals of Data Viz, mostly built in R. It has some great images in it, really well put together. I'm really enjoying making my way through it.
00:00:38
Speaker
Of course, I've been looking at the online version for a while. So one of the new things that I think we're all seeing more and more is people making their books more open source and putting out review copies before they actually get onto the bookshelf. So it's been fun seeing Klaus develop this book over time.

Journey to Data Visualization

00:00:57
Speaker
It's also fun to chat with him about his background in biology and the data visualizations that he's been working with and making and reading.
00:01:05
Speaker
in that area. It's always interesting for me to see how people approach data visualization and working with data and communicating data in different kinds of fields. So I hope you'll enjoy this week's episode. Here's my interview with Klaus. So Klaus, welcome to the show. Thanks for coming on. I appreciate it. Thanks for having me.
00:01:27
Speaker
Congratulations on the new book. You must be relieved. I am. I am very relieved. It's out. Are you taking a long break now?
00:01:35
Speaker
No, I mean, it's just so much work got put on the back burner while I was reading the book that now I just have to do all that work. No rest, right?

Academic and Professional Focus

00:01:48
Speaker
No rest. Do you want to talk a little bit about yourself and your background and give folks a sense of how you got interested in data visualization and ultimately how you came to write this book?
00:01:59
Speaker
Yeah, sure. So I'm a professor of integrative biology at the University of Texas at Austin. I consider myself as a computational biologist. I do a lot of data science, broadly speaking. I'm also quite involved lately in the R community. Originally, I'm actually a theoretical physicist. So I did my PhD in theoretical physics, and then I transitioned

The Art of Data Visualization

00:02:22
Speaker
Over to biology and so I have some amount of physics background I have a good understanding of biology and also have a fair amount of computing and data science background and I've really always been interested in visualization just like something I like I like to make things look nice and to look at
00:02:44
Speaker
visualizations. And so just something I've cared for a lot over the years. And to talk to my students, they'll tell you how picky I am when they show their figures. I'm like, ah, that font is too small. And here, there's two lines that are not quite in alignment. So somehow, I also just pick up on subtle visual cues that other people might not be so sensitive to.
00:03:10
Speaker
Are you teaching data visualization classes in addition to the biology classes?

Teaching Data Science

00:03:16
Speaker
I'm not. I might at some point in the future, but I'm currently not. My main class that I'm teaching is Data Science for Biologists class. It's kind of meant as the next class that they take after they've taken biostats, and it just
00:03:34
Speaker
moves them a bit more towards doing practical data science. And as part of that, they learn how to make graphs and to interpret them. And then we also do some machine learning and some bioinformatics and things like that. But there isn't a dedicated visualization class that I'm teaching.
00:03:49
Speaker
We'll talk about the book, but it's just interesting to me to think about how you teach biologists these other skill sets. So do you think having a separate DataViz class is necessary for those students or having it combined with these other data science skills the way you're teaching it now?

Skills Essential for Scientists

00:04:07
Speaker
Is that the right way to do it or is having them separated out, you would think a better way to do it?
00:04:12
Speaker
So I think what I'm teaching now is an essential class that really almost everybody should take. Everybody should have some basic knowledge in data science, data wrangling. I teach some R, tidyverse and some Python and we just like learn how to take a large data set and get patterns out and so on. And that I think everybody should be familiar with. And that's actually taught to undergrads and the idea is to
00:04:39
Speaker
to get them to learn this relatively quickly in their curriculum. I think a Data Vis class would be more an advanced class, maybe primarily for graduate students or really dedicated undergrads. So I think it would be much more of a specialized class versus general data science. I feel certainly in the natural sciences, every student in the natural sciences should have some basic data science skills.

R vs Python for Visualization

00:05:04
Speaker
So you mentioned that you're teaching R and Python, and I think the book, you did the graphs in R as well, right? That's correct, yeah. And so is R your go-to? Like, what are your favorite database tools?
00:05:17
Speaker
So yeah, for data visualization, I exclusively use R. Actually, the older I get, the more I use R and the less I use Python. So it's kind of an interesting transition that in the past I did a lot of Python and now I am in my own work almost exclusively R.
00:05:34
Speaker
They just have slightly different application areas. I feel Python is a good general purpose programming language. If you just want to write a game or you want to build an interactive web page where people enter information and get stuff back or so, Python is great for that. But for pure data science work, I just feel that R in many ways is more convenient.
00:05:57
Speaker
Do you find that your students struggle with learning R? I mean, they're undergraduates. So are they coming to it? This is the first language that they're learning. So it's just the uphill climb. They're not trying to think around other languages that they may have already learned.
00:06:10
Speaker
It's very mixed, so actually the background in the class is incredibly mixed. Some people have done tons of Python and have never touched R. Others actually know already some R. Our biostatistics class also uses R, so they have used it a little bit.
00:06:28
Speaker
I personally feel that, in particular, the tidyverse, we can do a lot of interesting stuff without really having to think about programming. We get halfway through the class before we ever do a loop or ever do an if statement because we just do things like... I don't know how familiar you are with the tidyverse, but in particular, dplyr, you're like...
00:06:51
Speaker
filter to pick rows and select to pick columns and then you group and you summarize. And so you can do a lot without ever really thinking about what I call the logistics of the data. Now, if you have a for loop with an index variable and you like first number in your vector, your second number in the vector, then you think about the logistics, right? Right. Because if you write a filter statement, give me all the numbers that are bigger than 10, then you only think about the logic.
00:07:17
Speaker
And so i found that the teaching starting with a tidy verse i can really put in on the logic without getting to walk down by the logistics and i can actually almost works better if people have absolutely no background whatsoever.
00:07:33
Speaker
If they come already with some preconceived notion of what programming is and how you do data analysis they find it very difficult to switch that off and to program without using loops for example but if they've never programmed before they don't miss the loop.
00:07:48
Speaker
Right, you're getting the first shot at their experience with programming, right? Yeah. Can you talk a little bit about the book itself?

Evolution of the Book

00:07:54
Speaker
Now, the print version is just coming out, so people get their hands on that. But you've had the digital versions been online for a while collecting comments. I would guess you've collected a lot of comments. But do you want to walk listeners through the goal of the book and what you hope they'll get out of it?
00:08:10
Speaker
Yeah, so the book really had its origin and me giving the same type of advice over and over to my mostly graduate students in the lab, so I would just find that
00:08:23
Speaker
They would show me their figures and I had the same comments over and over. Maybe like one biggest one that was the first chapter that I wrote was like, everybody makes their access labels too small. Universal truth of data is the access labels are too small. And actually almost every visualization software, the defaults makes.
00:08:45
Speaker
the labels too small. And so that's just like, I feel like I'm repeating myself and repeating myself and repeating myself. And so then at some point, actually, I thought about writing such a book for a long time. And I didn't really feel that I had the technology in place to make it sufficiently convenient that I was willing to
00:09:04
Speaker
do it. And R has developed to the point, I mean, the entire book is written in R Markdown, right? All the figures are automatically generated, I can just press a button and the entire book gets rendered just as it's up on the web page.
00:09:19
Speaker
And that technology in that convenience has really been only around for a couple of years. Like if I had tried to write this book 10 years ago, I would have written it in late tech and I would have had to keep track of every figure individually. And it just, that always seemed too much of a headache. I wasn't willing to invest that amount of effort. So on some level, I wrote the book now because I had tools to write it now.
00:09:43
Speaker
Yeah, and then when you commit to a book, then you also have to write all the chapters that maybe you didn't want to write, but they need to be there. I mean, the book is kind of three parts. The first part just goes through just all sorts of standard ways of visualizing data.

Content and Structure of the Book

00:10:01
Speaker
How do you visualize amounts? How do you visualize associations between variables? How do you visualize proportions? Things like that. Just the standard things, bar plots, scatter plot, line plots, and so on.
00:10:12
Speaker
And then the second part is about figure design and that just goes to various things that one should think about. One big one is, for example, color choices. How do you pick colors that
00:10:27
Speaker
Work and also that work for colorblind people and access labels not making the figures too busy but also not making them not busy enough and then the last part of the various other topics that i felt that should be in the book but i didn't have clear coherent.
00:10:47
Speaker
maybe heading. And that includes things like how do we combine figures into a larger document? Like how do we tell a story with a figure? Just touch on this very briefly. But also things like how do you save a figure on your computer? What's the right form? A lot of these things that we kind of expect everybody to know, you know, what's the difference between a PDF and a PNG? When do you pick which one?
00:11:11
Speaker
And nobody ever really spells that out, right?

Figure Design and Misconceptions

00:11:15
Speaker
And like you have people that read carefully through the specs of these file formats and understand for them when to use which most other people just.
00:11:25
Speaker
don't and just think mostly accidentally pick. Sometimes it works out and often it doesn't. I've been seeing a lot of people publishing reports or documents where the text is nice and crisp and then all of a sudden there's like a blurry graph in the middle and then the text picks up right after it and it's
00:11:44
Speaker
it's jarring to see this nice crisp text and all of a sudden this bar chart that is clearly pixelated because however they exported it from whatever tool they were using, they weren't using the right image format. So yeah, I think that is a big issue for people.
00:12:01
Speaker
Yeah, and like JPEG artifacts. All these little artifacts because people don't understand JPEGs. Or on the flip side, they understand that PDF is essentially a resolution independent format and tends to give you the best results. And so they plaster all the PDFs into a Microsoft Word document, and then the document becomes completely unresponsive.
00:12:26
Speaker
I mean, you could print it, but like online or on the screen trying to edit, it doesn't work. So, yeah, there's there's all these little tricks that everybody should know, but it's just too many of them and most people don't. And so I hope my book can fill some of those.
00:12:41
Speaker
Yeah, that's great. I mean, the other thing about the book that I think is one of the things that's not really out there is you explore not just a standard line bar pie, you know, area charts, there's, there's other chart types out there. And you spend some more times talking about those other graphs. So I'm curious about your take on the standard charts, which I mean, however you define standard charts, but you know, if you think about like, you know, lines and bars and, and, and pies versus non-standard chart types, which.
00:13:10
Speaker
You know, Mike just includes slope charts and dot plots as a, as a starter, you know, they're, they're not the sort of core things, but, but they are sometimes actually better ways to show data. And I'm curious how you balance the two when you're thinking about teaching people data vis or when you're thinking about what you see online in terms of people choosing different chart types. I think we should be adventurous, but we should also be critical.
00:13:33
Speaker
Right? So if I can show some data set in a way that maybe is not so standard, but really brings out a key aspect of the data, then we should totally go for that. At the same time, you can also, if you go to this Xenograss webpage, I mean, some of the ideas that some people try out are really maybe
00:13:58
Speaker
didn't quite work, but it's okay. I mean, we try it out and then either it works or it doesn't. There's a couple of things that I care a lot about. One is, if you write a report with, say, five or six figures, I think it's actually really important that every figure looks different visually. Like, if you have 20 pie charts, then after a while they all blend together and the audience really, I mean, oh, another pie chart.
00:14:26
Speaker
And you might be talking about a totally different topic now, but it's just all pie charts. And I should maybe not talk about pie charts because people have strong opinions on that. So let's talk about scatter plots, right? Everybody is on board. Scatter plots are a good idea when you have association data, right? But if I show you 10 scatter plots in a row after a while, your mind shuts off and they all look the same.
00:14:54
Speaker
Right and so i think it's actually really important to have a repertoire of different possibilities of showing data so that we can just keep it changing so that the audience is okay the scatter plot was this part of the story and now we have i don't know density plot.
00:15:12
Speaker
And now we're talking about something else, right? So you can really use just like we can use color and we can use fonts and so on to make clear that we're now talking about something else. We can also switch up the type of visualization that we use to structure a document. I care a lot about that a lot.
00:15:32
Speaker
Because definitely I've sat in PhD committees where every graph is a line that goes down. I can't keep this apart.
00:15:44
Speaker
And they're all different things,

Role of Reviewers in Visual Clarity

00:15:46
Speaker
but the graphs look all the same. Yeah. Well, what about, I mean, I think report construction is an interesting topic in itself. I mean, there are lots of journals out there, at least in social science, where they require you to put the figures at the very end of the paper as an appendix, which always bothers me because I want them to be integrated with the rest of the report. I want it to be an argument. And the visuals are supposed to support the argument. And by putting them at the back, you sort of relegate them to this secondary status.
00:16:13
Speaker
Yeah, but do they print it like that, or is it only when you submit? Some of them put them in the back, some of them move it later on. I guess it just differs by journal. Yeah, I know people care about that a lot. I guess I'm okay. If the figures are all at the end, I can find them there. What I care more about is that the caption needs to be with the figure.
00:16:34
Speaker
The worst format is you have a page with all the figure captions, and then afterwards you just have all the figures separately, and then you have no idea which caption goes with which figure. Right. Like figure one here, then the title or the caption, but then when you get down to the actual figure at the end of the paper, it doesn't have that same text. Yeah. It's certainly easier to lay things out that way, but I think it's a real disservice to the reader.
00:17:02
Speaker
Yeah, I mean, so this is a completely different topic, right? But I find scientific journals in their submission instructions, they mix guidelines for how the reviewer would want to look at it with how final production happens, right? They make you submit the figure separately as separate image files. They're really thinking about the final production of the paper rather than the initial review stage.
00:17:28
Speaker
And I think we should just be allowed to just submit a PDF with all the figures embedded, however we want, wherever they fit correctly into the document flow. And once everybody agrees that this is an appropriate publication and we want to publish it, then we can worry about production. And I wish this was more separated, but many certainly scientific journals kind of model the two together and that's where the problems come from. Now, I'm guessing you're like me when you're reviewing a paper for a journal, you are commenting on the visuals as well.
00:17:58
Speaker
Oh, yeah. I mean, I was arguing with someone about this a couple of days ago. Now we're totally off topic of your book, but this is this is really interesting. So is there a responsibility from the from the journals to encourage their reviewers to review the visuals as well?
00:18:20
Speaker
So responsibility is a strong word. So I feel like ultimately the responsibility is always with the author is my own opinion. Like review is the purpose of review is to improve the product. It's not necessarily to verify that the product is correct.
00:18:39
Speaker
Certainly in biology, well, nobody's going to spend a million dollars and three years of trying to actually validate that everything is correct. We have to on some level accept it as it is, but then we can look at it. We kind of do a basic sanity check if something sounds totally outrageous.
00:18:57
Speaker
Point that out and most of the time though certainly when I review I try to help the office improve the paper and if they have visualizations that clearly are not going to work for the audience. It's actually in the office interested that point that out.
00:19:14
Speaker
In the end, I strongly believe and people have the right to embarrass themselves, right? So if I say this is really a bad visualization and they say, no, we like it, we want to have it that way. Okay. Well, it's your choice. But in at least I said it. Right. Right. You feel like you've done your job as a reviewer. Yeah, exactly. Unless it's clearly wrong.
00:19:40
Speaker
I mean, things are just objectively wrong. Well, objectively wrong when you point it out. But if it's more, well, you really should consider using larger labels in all your figures because nobody can read this. The author insists that they want to have figures nobody can read. I mean, in the end, it's their choice. Yeah. Have you ever written your review back to the editor and saying, look,
00:20:06
Speaker
My comments to the authors were such and such, but really the graphs are so bad that this would really need to be resubmitted as an entirely new thing. The authors really need to rethink the way that they're presenting this because the graphs are just so horrendous. I have done that, yeah. But then again, in the end, if the authors insist, I would say, okay, it's your choice.
00:20:29
Speaker
It's amazing to me that, well, maybe it's not amazing to me still, but I want to say it's amazing to me. There's not a lot of thought given to the reader, even of academic journals. So even if you're thinking about someone, you know, researchers thinking about communicating to other researchers, they still don't think about the audience.
00:20:46
Speaker
and the text is very dense and the graphs are really hard to read. And then you see these in publication. And I feel like there should be someone at some point who should be, you know, there's the editor, there's the peer reviewers, there's the, you know, the desk editor, then the editor that might be actually laying things out that someone's got to say, look, this is really hard to understand.
00:21:05
Speaker
Yeah, I mean, I think it's the reviewers, really, that should do that. And actually, also, when I work as editor, most of the reviews that I get are along those lines. Well, this really, I couldn't quite understand this, the authors, I encourage the authors to improve. I personally feel it's a wasted opportunity, right? If you write a dense paper that nobody can understand, then in the end, nobody's going to hear your message.
00:21:31
Speaker
And so when i tell an author to maybe reconsider how they're presenting their work, I'm really trying to help them to get the message across better. But some authors listen and others don't.
00:21:46
Speaker
Now, I'm curious about your experience in the biology literature, because I spend, obviously, my time in the social science literature. And I dabbled in just perusing some biology literatures. And the graphs were like, I mean, I'm sure for biologists, they're second nature. But for me, there's dendrograms and things showing gene breakdowns. And they look completely foreign to me. So I'm curious about your experience in the biology literature and the type of graphs that folks use to present their research.

Critique of Complex Visualizations

00:22:15
Speaker
Yeah, so graphs in biology can be wild. So the one thing that I think biology does well, but it might still be unusual when you come into it, is biologists are very good at drawing diagrams or schematics. So you have a gene and you have a regulator and you have a promoter and so on.
00:22:35
Speaker
pathway, those kinds of diagrams that they just use the same visual language over and over. And so every biologist looks at it and says, oh, there's a gene and there's an enzyme and oh, there's a connection here. And if you have never seen those, it would just be boxes and arrows and you wouldn't know what is going on. So that actually, I think, biologists do very well in other fields, maybe could do more of that also.
00:22:59
Speaker
Then there's this other part and that's mostly in computational biology and like high throughput systems biology and so on. They use incredibly dense and complicated visualizations where honestly I feel like the typical nature paper these days you open it up and there's like beautiful colors and it could work as modern art on your wall and
00:23:24
Speaker
I am convinced no reader actually understands what's going on. And the problem is not only are the visualizations incredibly complex, they also tend to be of incredibly derived quantities.
00:23:38
Speaker
You do some complicated measurement of millions of values and then you calculate some sort of summary statistic of subsets of data and then you take those and you pull them again and average and then you integrate or whatever and in the end you still have a million numbers but they were like sent through a pipeline of 10 computing steps and really nobody
00:24:01
Speaker
can understand what it is. So I'm very critical of that because I feel there's a lot of, it looks cool and people kind of think they should like it because clearly it was a lot of work, but the inside, it's not clear that they actually convey that much insight. Yeah. The beauty and the complexity part might be valuable in a different context, but in the journal article world, do you want to make that argument?
00:24:29
Speaker
In the end, there should be an insight, right? After we have spent, I don't know, $500,000 on experiments and graduate student time working for three years and making millions of measurements,
00:24:44
Speaker
It would be good if there was a clear insight at the end and not just, oh, here's stuff. So now we can come full circle and come back to the book then.

Conveying Insights with Data

00:24:54
Speaker
So that's the primary goal of the book, is to help people use Data Vis to provide insights to their readers or their users. If that worked out, that would be great.
00:25:08
Speaker
I mean, I kind of touch on all of these things. Some of them maybe are only a short section. So one thing I really care about in writing reports says, I feel you always should go from data that is the closest to raw data. And then as you go along, you kind of, you can work with more and more processed data until you have some very derived quality at the end. Right. So like.
00:25:34
Speaker
you measure some quantitative variables and you start with a scatter plot and then you can turn that into maybe you would do a regression, you have a correlation and then if you have a lot of correlations and you can visualize them as a heat map and then maybe you can summarize heat maps into a pie chart where like some grouping is this way, some grouping is that way. So that would be a sequence of you start out at something is very close to a number that you can imagine that was measured
00:26:02
Speaker
At the end, you have some highly derived quantity. And it's really important to have this sequence. If you go backwards, you immediately lose everybody. And if you just start at the end and never show the less derived parts of the analysis, then also everybody's lost. That's somewhere in the book. It's only a few paragraphs, but it's in there.
00:26:29
Speaker
Right. Well, I'm sure people will check it out. So there's the the online version. And then there's the print version that just that's just coming out. So so good luck with it. I'm sure you're at least relieved that it's done. And and congrats again on getting it out there. Thanks for having me. Yeah, thanks for coming on the show. This is a lot of fun. We veered off a little bit, but this is a lot of fun. All right. Thanks, Klaus. I appreciate it.

Conclusion and Call to Action

00:26:55
Speaker
Thanks everyone for tuning into this week's episode. I hope you enjoyed it and I hope you will check out Klaus's book, The Fundamentals of Data Visualization. Also, if you're interested in supporting the show, please consider leaving a review on your favorite podcast provider or put a couple bucks a month towards the show on the Patreon account where you can help support the show. Help me cover costs of editing and transcription services and all the things that
00:27:21
Speaker
I need to bring the show your way. So I hope you enjoyed this week's episode. Until next time, this has been the Policy Vis Podcast. Thanks so much for listening.