Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
#3 David Martens: Data Science Ethics image

#3 David Martens: Data Science Ethics

AI and Technology Ethics Podcast
Avatar
101 Plays7 months ago

David Martens is a professor of data science in the department of engineering management at the University of Antwerp, and he is the author of Data Science Ethics: Concepts, Techniques, and Cautionary Tales.

Some of the topics we discuss are the relationship between data science and artificial intelligence, ethical concerns during the data collection process, the European law known as the General Data Protection Regulation, the problem of re-identification of individuals if data that’s made public isn’t properly anonymized, and the dangers of launching powerful AI models for use by the wider public without any oversight—among many other topics. I hope you enjoy the conversation as much as I did.

Recommended
Transcript

Introduction to AI Ethics with David Martins

00:00:16
Speaker
Hello, and welcome to the AI and Technology Ethics Podcast. This is Roberto, and today I will be in conversation with David Martins. David Martins is a professor of data science in the Department of Engineering Management at the University of Antwerp, and he is the author of Data Science Ethics, Concepts, Techniques, and Cautionary Tales. Some of the topics we discuss are the relationship between data science and artificial intelligence,
00:00:45
Speaker
ethical concerns during the data collection process, the European law known as the General Data Protection Regulation, the problem of re-identification of individuals if data that's made public isn't properly anonymized, and the dangers of launching powerful AI models for use by the wider public without any oversight, among, of course, many other topics. I hope you enjoyed the conversation as much as I did.

Basics of Data Science and AI Models

00:01:27
Speaker
Okay, let's kick off with a bit of a basic question. For context, I imagine some American listeners will know what an AI model is or what an AI algorithm is, but they might not know what a data science model is. Perhaps it's just a different nomenclature. So very basic questions to get us going here. What is data science? What are data science models? When did this technology come about? And what can it do for us?
00:01:57
Speaker
Right. Okay, so most of what we call AI these days is very similar to what we previously called data science. And so it's mostly about learning from data and typically to make predictions about something. So for example, think of all banks are using data science or AI to predict whether you're credit worthy or not.
00:02:18
Speaker
So they use data that they have of all the customers that ever took out a loan. And they look for patterns. How do the good customers differ from the bad customers? How they differ in their income, in their profession, and so on. And so computers can find these patterns to then make predictions about people applying for a loan. And so before they call it data science, honestly, people are calling it AI models these days. This naming, it's actually under the hood very often. It's the same thing.
00:02:47
Speaker
Yeah, I feel like sometimes things like deep learning, the feeling of newness just comes from good branding.
00:02:54
Speaker
Yeah, so deep learning is maybe an innovation of the last decade or two decades, which is quite new in the sense it's really large neural networks. But neural networks are not new at all. They've been around since the 50s. But the innovation was building these very large neural networks in the grades when you work with images, for example, or with text like chat GPT. But most of the AI models that we use in daily life, like banks or fraud detection or spam filters,
00:03:24
Speaker
These are all just typical data science models that have been out there for a long, long time. Wonderful. Well, I love how you break it down in the book. You basically go over the different stages of the data science process. Let's move on to that. What are the different stages of the data science process?
00:03:47
Speaker
Yeah, so there's many different setups that we can use. So in the book that I use, I start with the data gathering. So it always starts with where is the data coming from.
00:04:00
Speaker
Then it's use the data pre-processing or preparing the data. So what do you want to do with the data? What do you want to predict with it? Next we have the modeling itself. So we're going to find the patterns in the data. Then we need to evaluate it. So how good are these patterns? What did you find? How good is this AI model in the sense that I'm going to use it?
00:04:19
Speaker
And then finally, if I'm going to deploy it, what do I need to consider at that time? So it's really the different stages as you go to a typical data science or AI project, going from getting your data all the way to, OK, now I'm going to deploy it into the field.
00:04:34
Speaker
Great. We'll dive into these stages in a second. Before I do that, I did want to discuss before we start the, now I'm not sure if I'm pronouncing it correctly, is it the FAT flow framework or is it the FAT flow framework? Yeah, I call it the FAT flow framework.
00:04:54
Speaker
Yeah. Okay. Let's talk about that for a second.

Ethical Frameworks in AI: FAT Flow

00:04:58
Speaker
You're using the fat flow framework where fat stands for fairness, accountability and transparency. Right. Tell us a little bit about what each of these means.
00:05:08
Speaker
Yeah, so it's it's actually a term these three criteria have been out there in the community for a bit. So there was this fat conference, which is really well known in our own field of data science ethics. And I feel that these criteria really well cover most of the ethical issues surrounding AI.
00:05:28
Speaker
So fairness, for example, it's treating people equally. And so it's mostly about fairness, non-discrimination. So these AI models make decisions about all of us. For example, do I get credit or not? Do I get accepted to university or not?
00:05:43
Speaker
And often these models work differently for different groups. So it might be that it works much better for men than for women. And so Fairness really says this model should work as well for men as for women. And it's very often simply not the case if you don't actually look at it and see what's happening under the hood.
00:06:03
Speaker
And so next to that, there's also privacy. So treating people equally according to their privacy rights. So often it's very personal sensitive data. It's a topic that's covered in depth here in Europe. We're all very well aware of it. Then transparency.
00:06:20
Speaker
That's all about, well, how transparent are you about your process? Going from where's your data coming from to up to the end, all the way down to deployment, who gets access to your AI model? If we talk about chat GPT, for example, these days, what data did it use to build that model? Much copyrighted data. You see the lawsuit of New York Times, for example, that's all about the data gathering. If you go all the way up to deployments, there it's about who gets access to chat GPT.
00:06:48
Speaker
Should miners be using chat GPT? Should mentally unhealthy people, unbalanced people, should they have access to chat GPT and so on? So in all of these stages, you have to be transparent about what's happening, who gets access and so on. And under transparency, you also have explainable AI. And that's basically explaining these decisions that these AI models are making. So if you say Roberto, you don't get credit,
00:07:14
Speaker
But we do not give an explanation. It's simply this very well-performing deep learning model says we shouldn't give you credit. Of course, that's unethical. That's not right that you don't get an explanation because this black box model says so. And so that in itself is a whole field explaining why this machine is making some decisions that are really impactful for most of us.
00:07:34
Speaker
And then finally, the aids for accountability. And that's basically going from these nice theoretical policies and then principles to really practice. Can you show that you've put some effective measures into place that you move from theory to practice? And so that's, of course, very important. Otherwise, you might just have these principles on your websites, but nothing really happens in your daily lives.
00:08:00
Speaker
Right, right. Okay, so intuitively, it's seems desirable that models are fair that the people that wield them be held accountable.
00:08:08
Speaker
that they be transparent, et cetera. But there's another appeal to ethical data science that you argue for in the book. You say something to the effect of ethical, ethical data science just is quality data science. That is, ethical practices actually improve the data, right? They improve the models and all that. So can you just tell us a little bit about that?

Enhancing Model Quality through Ethics

00:08:34
Speaker
Yeah, absolutely. So if you're able to, if you have an AI model, for example, and that's not performing well on women for whatever reason, for example, you don't have enough data on women on their credit history. And so if that's simply revealed, that's really useful. You can start looking for more data on women who got credits to improve your model. And so it really improves your model. It improves your data in the sense that you get the right data that's more representative for what you're actually doing.
00:09:03
Speaker
If you can explain why you can't get credit and the explanation is well because Roberto wears a hat, then you say, well, but that's totally not right. That's the wrong pattern that has been found. So then that's reviewed and you can update your model or you can update your data by improving the data.
00:09:24
Speaker
And so if you don't have this kind of explanation, if you don't look at fairness, these things simply do not happen. And so the book is full of these kind of cautionary tales where you see that things go wrong. If you don't explicitly embrace these things, explicitly look for, is the model fair? Do you consider privacy? Can you explain why this very seemingly good model is making these predictions that you can see all kinds of things go wrong?
00:10:02
Speaker
Let's dive into the data science stages now, and of course, we'll begin with the data collection phase.

Importance of Data Privacy and GDPR

00:10:09
Speaker
Okay, so it goes without saying that we create tons of data as we work our way through the internet, as we use it, social media platforms, all that, and it is being collected, it's being sold, and it's being used for various purposes. Let's start here with a question that you can explore in whatever way you find appropriate.
00:10:30
Speaker
Let's just say that I'm not doing anything illegal, and for the record, for the listeners, I'm not doing anything illegal. If I'm not doing anything illegal, why should I care that my data is being collected and sold?
00:10:49
Speaker
Well, people often say these kind of things like, why should I care? I don't have anything to hide. So that's a very common argument. But of course, we have something to hide. We all have something to hide. For example, how much we make, what we look like, the report of our last doctor visits.
00:11:11
Speaker
Perhaps are great. There's many, many things that are simply sensitive. Your sexual orientation at some stage in your life, you might be unwilling to show this to everybody in the world. And even if you're one of these very, very few people who say, no, all of these things can be made public, I'm totally fine.
00:11:31
Speaker
Even if that would be a reasonable person, they would agree that most people would not be fine with that. So it's very important that we're aware of this kind of personal sensitive data. It really matters that we take care of it.
00:11:46
Speaker
Agreed, yeah, agreed. On that topic, you discussed the General Data Protection Regulation, which is a European law. Can you tell us about the GDPR? What is it? And do you think we should embrace this type of regulation here in the States and beyond that?
00:12:06
Speaker
GDPR, when it came out, I was kind of off the fence. Is this now a great thing or is this really some bureaucratic cumbersome thing? So let me start with why it's perhaps a bureaucratic. So for every time we do some data project, we have to make sure that every data that we use, we provided it legally.
00:12:27
Speaker
We obtained it legally, and most of the time it means with informed consent, we have to fill out some forms. Every company needs to have a data privacy officer, so it comes with really overhead.
00:12:41
Speaker
But after a couple of years, I'm really convinced that this is something that's needed. Since GDPR, things really changed in the sense that we as Europeans basically don't have to worry about privacy anymore. It's really covered by GDPR. All these things that I said, there's personal sensitive data that might be traded about us, that might be revealed to the public about us, that's really covered under GDPR. So I think it's a wonderful thing.
00:13:07
Speaker
In the book, I briefly talk about the GDPR as a kind of regulation to cover privacy, but not in much detail. And I actually focus mainly on one of these articles, which states how you should treat data. And I think that for a generic person working with data, it has like the six principles, and it's just really
00:13:32
Speaker
really powerful to just look at these six principles where it says you should only get data when it's lawful. So don't use data that's stolen or if you shouldn't get access to it, like chat GPT, copyrighted data and you know we shouldn't be getting this, just don't use it even if it's really valuable.
00:13:51
Speaker
only used for the purpose that you initially had in mind. So if you're using data as a bank to do credit scoring to decide when to give a loan years or no, don't suddenly use it to sell ads or don't suddenly sell it to a data broker. Keep it to the initial purpose for which you had consent from the user or for which you find it to be illegal.
00:14:13
Speaker
And then just how you should treat data. Don't keep more data that's necessary. Make sure it's accurate. Don't keep it for longer than it's needed. And then lastly, make sure that it's stored correctly. So integrity and confidentiality. And these are just
00:14:28
Speaker
Most of the regulation is like, of course, it's common sense if you use data, but it's really nice, these principles to just cover for yourself. If I go to the data science project, do I really cover these six principles? So even if you don't need to comply with it, it really holds so very interesting, let's say paradigms on how you should treat data.
00:14:51
Speaker
Great, great. Well, on the topic of the GDPR, one of the key goals of this law is protection of the vital interests of data subjects, other sensitive data, via say some sort of encryption for safe storage for their data and other methods of course. Basically the law wants to ensure that data stays private and that it's not stored longer than it needs to be since that's just an added vulnerability, things like that.
00:15:20
Speaker
But some governments have pushed back on that. These governments have requested from tech companies that they provide what they're calling backdoors. So let's begin with this. Why would governments want backdoors?
00:15:38
Speaker
Yeah,

Government Backdoors and Privacy Challenges

00:15:38
Speaker
absolutely. So if you think about it, privacy is not absolute anywhere. So if a judge signs off on a warrant, they can come into your house and search all your files and mess up everything that you have and read everything that's in your house. So they can go to the bank and read how much you make and so on. And so we all agree privacy is not absolutely in any case. So the argument is, well, it shouldn't be for digital data either.
00:16:06
Speaker
Some privacy is not absolute, so you might be encrypting your data, but we as a government, if we have a warrant from a judge, we should be able to read that encrypted data on your iPhone, for example. Even if the general public is not able to read it, we as a government sometimes just really need to have access to it. For example, if it's
00:16:25
Speaker
a suspected terrorist and we're afraid there might be a terrorist attack or just missing children and we think on this phone there might be information about where the child is. I think most of us might think the argument, yes, our privacy is not absolute. Let's give the government a back door so that they actually can read that information.
00:16:44
Speaker
Right, and you discuss in your book historical cases that sort of create the perceived need for greater breaches of privacy. I think you mentioned a couple who went on a shooting spree and then the federal government wanted to access their phones to get information on that, on the event. So in this context, we have the GDPR going in one direction for greater protection of the sensitive information of data subjects.
00:17:12
Speaker
and the government pushing back in the other direction. Tell us about some of the arguments against the back doors and why we should side more so with privacy. Yeah. Yeah, so indeed it was in the book I detailed the story of Sunbat and Dino in 2000. I think 16 it happens.
00:17:33
Speaker
and where these two terrorists were shot down basically afterwards. And they found an iPhone that was encrypted and they couldn't get into the iPhone. And they thought there would be more information about other potential attacks.
00:17:47
Speaker
And so they asked Apple, well, can you write some software to have us be able to unlock it? Because if you do it 10 times with the wrong code, it will totally be closed. You can never get access to it anymore. And actually Apple declines. They said, no, we're not going to do this. Even though the use case, I think it's, it seems a use case that's valid. But so something that Tim Cook says, well,
00:18:12
Speaker
If you would do it once, basically what you would do is you would build this kind of master key that would open any iPhone. And so he calls really a master key for millions of logs that could be opened. And so even the fact that this master key exists will mean that you can open all of this and you can bet that malicious actors, other states will try to get access to that master key.
00:18:37
Speaker
And so then you have this kind of monster key that will worsen the security for almost everybody. For this one specific case, all of our privacy is diminished because once somebody else can have access to our personal sensitive data, then we can be the subject for blackmail or just our data falling into wrong hands. And so he really says it's a security versus another security issue.
00:19:02
Speaker
where the bigger security issue is that you built this monster key which shouldn't be existing right now. One that's really common is security versus freedom. It's like you want to make everybody more secure by being able to capture terrorists or get this more information, but at the expense of being free of having this personal data at your hands only.
00:19:28
Speaker
I personally feel that the other argument of security versus security is much more convincing. And the last, also quite convincing, is mostly the futility of these kind of backdoors.
00:19:41
Speaker
The Boston Marathon bombers, they were family members. They just talked to one another. And once terrorists or malicious actors know about this, that the government can read these things, they'll just use something different. They'll use paper or they'll just talk to each other in person. And at that time, all of our privacy and security will have diminished at that time. So it's really a loose, loose situation.
00:20:09
Speaker
And so there's many arguments in both directions, but those are mainly the main reasons why people say we shouldn't be doing this. And so Apple didn't do it. And FBI said they were able to get access to it through some other third party, and they didn't disclose exactly who the third party was. And they didn't find any other information about terrorist attacks. So you see, it was a futile endeavor in any case.
00:20:38
Speaker
Okay, so before moving on to the next stage, I did want to talk about bias in the data sets.

Bias in Data Sets and Models

00:20:46
Speaker
There are various ways in which bias can creep into the data science process. You gave the example in your book of an app that was supposed to track potholes, but the people who had access to the app and who use it more frequently,
00:21:01
Speaker
just happened to be younger and in more affluent neighborhoods. So those more affluent, younger neighborhoods more reliably got their potholes fixed. And that's obviously not good. There's also the controversy over Amazon trying to automate their hiring process. And it seems like men were overrepresented in the resumes that they trained their model on.
00:21:25
Speaker
And so the result was that there was a bias against women whenever this model was fed a couple of resumes and the female applicants wouldn't get to the recruiters. Talk to us about how it is that bias creeps into the data science process and tell us what happens when that bias gets into the picture.
00:21:54
Speaker
Yeah, sure. So the examples that you show immediately indicate what big the issue is. Like with Amazon, the model that they built, if you had the words women on your resume or some name of a college where only women can go to, your resume would automatically be downgraded.
00:22:14
Speaker
I think we all agree that just because of your gender, you're downgraded that that's clearly not the right thing to do. And these days we all have this trust in these AI models. Yes, AI must be right. Well, the more you know about AI, the more you realize you should be very careful with these kinds of decisions. And so this kind of bias really creeps in very easily. For example, with Amazon,
00:22:38
Speaker
It seems like in the past, mainly male engineers were recruited by Amazon. So we don't know exactly, but it's something that we can assume. So that's basically whether your data was representative or not. Another thing is how you define your target variable. So what do you define as a good candidate?
00:23:01
Speaker
Is it somebody who stays with the company within two years or gets a promotion within two years? Well, what about women who get pregnant and take some leave of absence? Well, women by definition then will be less good candidates because the first years it might take longer to make promotion. So if you don't take into account what is a good candidate with the differences in gender,
00:23:23
Speaker
Automatically, the AI model will learn, well, it seems in this data set women are less likely to be good candidates. And because of that, I will not have them be selected. And so there's these kind of ways in which bias can easily creep in. And it's really important to just measure for it, to be aware of it. And there's not always a perfect solution, but at least we should be aware of it and know whether it's happened or not.
00:23:51
Speaker
Okay, we've been on the topic of data collection and we have not exhausted all the interesting issues that you discuss in your book regarding that stage. But in the interest of time, let's move on to the pre-processing stage because this brings up a topic I really wanna talk about, which is re-identification.

Re-identification Risks in Data Processing

00:24:12
Speaker
So yeah, let's work our way in that direction. We're in stage two.
00:24:17
Speaker
We're labeling the data. You're right that there already exists methods for ethically storing data, for making datasets public in privacy-preserving ways.
00:24:28
Speaker
and other issues relating to this stage. You're right about k-anonymization, I think it was. Very fascinating, the algorithms. They probably don't make for good radio, but okay. The data is collected and hopefully it's done in an ethical way. If the pre-processing stage isn't done ethically, that is without carefully securing data subjects' privacy, data subjects can be re-identified.
00:24:57
Speaker
What is re-identification and why should we be concerned about it? And maybe in your response, you can talk about Netflix's data deluge. Yeah, sure. So I see it so often that people say, well, we have this data set on people, but we removed the name, the address, and the email. So now the data is anonymous because we removed all personal identifiers.
00:25:25
Speaker
And that's something that Netflix also did. They have all this data of persons watching movies. And they actually made that data publicly available without any names.
00:25:37
Speaker
And so it was just user one watched these six movies on those dates, user two watched these movies on those dates and so on. And the reason they did that is they said they basically challenged the community, the data science community. And they said, whoever is able to improve our recommendation model, so predicting what each user would like to see as another movie, whoever is able to improve that model by 10%, our own model by 10% gets a million dollars.
00:26:05
Speaker
And of course, for academics that we are, everybody jumped on that. But it took many years before anybody was able to improve by 10%. But the ethical issue really is that many of these people were able to be re-identified.
00:26:20
Speaker
And so meaning that it's user 123 watched these 10 movies on these dates. And by linking that with publicly available datasets, in this case IMDB, it's a website where you can rate movies yourself.
00:26:35
Speaker
And on that data, on that website, typically you disclose who you are and you say, I watched this movie on that date. And by linking these two, you see, well, but there's only one person who watched exactly those 10 movies on exactly those 10 days. And so there's only one person who's in both of those datasets. And then suddenly you see, well, this person on Netflix also watched
00:27:01
Speaker
several movies that are categorized in the gay and lesbian category. And so suddenly that person's sexual orientation is revealed. Or similarly, you see this person is watching really Catholic movies or Christian movies. You see, oh, well, wow, it seems to be a Christian or very democratic things like Fahrenheit 9-11. So there's many sensitive categories like political sexual orientation that can be revealed by re-identifying someone.
00:27:31
Speaker
And so Netflix actually wanted to do a second prize, do this again. But there was a lawsuit by someone named Jane Doe. We don't know who she is. And she said, I don't want this because my sexual orientation might be revealed. She was a lesbian. And this is not publicly known yet, neither at the school of my children. So I don't want this to happen because I'm afraid I will come out of the closet without me really intentionally wanting to do this.
00:27:59
Speaker
And so the lawsuit was settled and there was never a Netflix 2 prize. So you can see that these kind of things that you wouldn't think about immediately, like it's just a list of movies that somebody watched was the big deal. You really need to think about these things deeply to understand what the deep ethical repercussions are about these things.
00:28:18
Speaker
And so in the book, I have many of these stories about re-identification, how important it is based on location data, the web pages that you visit. All of these things really define you quite uniquely and also reveal sensitive things.
00:28:32
Speaker
And so it's full of these stories where you see even these big tech companies, often they're really not unethical, but they simply haven't thought these things through. And that's why I'm so glad so much emphasis is being put on this now that at least we go through these exercises. Is the data really anonymous or did you just remove the personal identifiers and you can still re-identify some people? And it's a really important distinction to be made.
00:29:12
Speaker
Okay,

Privacy-preserving Algorithms and Fairness

00:29:12
Speaker
now we're in the modeling stage. So this part of your book is really for the data science practitioners. Let's say we've collected data, we've pre-processed it. It's time to make the model.
00:29:24
Speaker
So I found it very interesting that there already exists various privacy preservation algorithms and methods. You mentioned differential privacy and homomorphic encryption and a few others. The one I want to talk about is discrimination aware modeling. But to set this up a bit, let's talk about biased recidivism prediction models first. Can you tell us about the track record of some of these recidivism prediction models?
00:29:56
Speaker
So it's quite common apparently in the US to predict for a suspect when he or she arrives before a judge to predict whether that person will recommit a crime within two years or not. So that's really interesting for a judge to know to sentence that person. And so there's these tools out there that actually do this kind of predictions. But it seems that they might be discriminating against black people.
00:30:24
Speaker
And so there was a study by ProPublica where they said, well, black people are misclassified as likely to reoffend, much more likely than white people. So black people that actually wouldn't do anything wrong are predicted. He has sentenced this person to a long prison time, much more likely than white people. And so they said, look, this is really, really bad what's happening here. These models are discriminating.
00:30:51
Speaker
But the difficult thing of this whole discussion is how do you measure whether something is discriminating? So ProPublica says it should be making mistakes at the same rate. But what the developer of this tool said was, well, actually, if we look at people predicted to recommit a crime,
00:31:11
Speaker
Black people and white people recommit a crime at the same rate, both at about 70%. And so it might be, wait, how are these two things different? Well, they're just slightly different. But if you measure them slightly different, you get different results, whether it's fair or not.
00:31:28
Speaker
So within this field, it has even been mathematically proven that you cannot have both at the same time. So it's really being about being transparent. What do you take as your fairness measure? Why is that one measure?
00:31:42
Speaker
If I commit, you're likely going to reoffend, then it's both at the same time, it's correct at the same rate, whether you're black or white, versus, well, if you wrongly predict to recommit a crime, that should be at the same rate for black and white people. Why should you use one of these two? And that's where really the whole discussion is needed. And it shows how complex this whole field still is about what is now the right metric to actually measure fairness. It's not just we should look at it, but how should we do it?
00:32:12
Speaker
And that's where more ethical discussions come in. It's not a technical solution. Here's a metric implemented and we're fine. No, it's an ethical discussion. Should we be using this measure or that measure? And so that really requires people with domain expertise, with perhaps some ethical backgrounds needed, not just computer scientists making these kinds of decisions.
00:32:34
Speaker
Definitely this call to all domain experts because there seems to be a tendency to be overly credulous on the part of the lay public to many of the findings of these models. They'll say, well, the algorithm told me this. I know there's a study where in mock trials, they'll present the exact same case to two groups of subjects. The only difference is that one group gets presented some neuroscientific brain imaging data.
00:33:01
Speaker
Well, whatever side has these fMRI scans or whatever tends to be the side that wins. So there's this tendency, it might be innate for all I know, to want to defer to these quote-unquote objective measures. So this is a very important topic.
00:33:18
Speaker
Yeah, absolutely. And so I recently, I was in the jury of a PhD and I asked a person who was working on image data, I said, isn't there a potential for misuse? A very specific question on this area. I said, oh, but that's actually for the people who use this technology. It's not up to me to look into that. And you see so often this happening that the computer scientists, data scientists says, well, it's how you use that technology.
00:33:42
Speaker
And the person using it thinks, well, I'm sure technically it's ethical. It's fine. And I'll just use it for this thing that I have in mind. And so you see, it's really important that both data scientists but also users of AI are aware of these risks and not just point to each other about this.
00:34:00
Speaker
Agreed, agreed. Okay, so then talk to us a bit about discrimination and where modeling tell us about what that is and how they might help out in this kind of situation.
00:34:13
Speaker
Yeah, so there's techniques to actually try to get rid of some of this unfair bias while you're building your model. So most of the time when we build this kind of model, we say, let's make sure it's as accurate as possible. So we're making predictions about who to give alone or not. We have this data set where we know who actually repaid alone, who did not. Let's make sure that the model predictions are in line as much as possible with this actual outcome.
00:34:44
Speaker
Now, what we can do is say you shouldn't just make sure it's accurate, also make sure it's fair. Fair with whatever metric you use. For example, you should make sure that white and black people should get credit at the same rate. And so you combine not just saying it should be really accurate, you say it should also be fair. And so there's techniques out there to really enforce fairness more instead of just looking at this accuracy.
00:35:11
Speaker
So in the book, by the way, I never really go in too much depth. So it's a book not intended for deep technical guidance on how to exactly do this. But I found myself so interesting how much technology is out there on fairness, on privacy.
00:35:29
Speaker
And I try to explain it in almost layman's terms. And so I hope that people get an idea of, look, there are solutions for these kind of things. It's not just, well, hands up in the air. It's unfair what to do. And it's also not extremely complicated. Only data scientists can do it. It relates to this thing where I feel people should be aware what kind of solutions are out there. How can it be implemented?
00:35:57
Speaker
And this is definitely a discussion that should be had by the wider public because you do bring up that there are different conceptions of what fairness means.

Conceptions and Discussions of Fairness in AI

00:36:06
Speaker
And I'm sure you're familiar with the views of some of my compatriots here in the States and about how there might be some disagreement about what it might mean to make these models fair. Can you talk to us about these different conceptions of fairness? Yeah, so it's
00:36:27
Speaker
There's many metrics there and it's like equal outcome versus equal opportunity. So how do you measure this? And that in itself is really, really a hard discussion to have. So should, for example, something that's often used in demographic parity, that, for example, men and women should be given credit at the same rate?
00:36:50
Speaker
If 80% of men get credit, then 80% of women should be getting credit. You might say, well, that seems fair. But then might be, well, hold on a minute. Women, their income on average is lower than that of men.
00:37:03
Speaker
So as a bank, do you really want to give as much credit to women as to men, knowing that their income is lower and so they're maybe more likely to default? As a bank, you might say, well, this really has a bad effect on credit risk. And we agree income is really important for credit risk that you're able to repay the loan. So that might not be the right definition. It should be all women with a high income, all men with a high income should be given credit at the same rate.
00:37:33
Speaker
And so in that way, there are like dozens of different measures where you can argue for all of them. And so it's really, that's where the ethical discussion should be. How are we going to measure this? And what are we going to use? There's this example where we see that black people are admitted less to colleges than white people.
00:37:53
Speaker
by these AI systems and if you look at what is the reason well SAT score comes out as one of the main reasons why black people are more likely going to be rejected and if you then look at the population you see a distribution of black people SAT score is a bit lower than that of white people.
00:38:11
Speaker
because of social research tells us, because of less income to get tutoring, less emphasis by the parents on having a high SAT score, and so on. So there's many reasons for it. So then there's an ethical discussion, should we be using SAT score in our AI model or not? And so these are all just really hard questions, which are not just a data science question, should I include this variable, yes or not. It's an open ethical discussion to be had by colleges.
00:38:41
Speaker
being transparent in it, we're not using it because of this reason, or we feel this is really needed because we want to admit students follow all over the country and it's the only way we have to have some general agreement on how well they're performing before.
00:38:56
Speaker
So you see it's being ethical in data science, it's not an easy data science problem, but mostly it's an ethical discussion where the people having discussion should be aware of the issues and these metrics and the technical issues around it.
00:39:13
Speaker
fascinating and that is a good transition. So you just mentioned that SATs are an important marker for these models but at least we know that SATs are important for the end output of these models because as is well known
00:39:33
Speaker
There are many models where we have no idea how they do what they do.

Transparency and Trust in AI Models

00:39:38
Speaker
They're black boxes, as they say. They are opaque. But ethical data science, you're right, calls for understandable and comprehensible models. We don't want a large neural network making decisions without us knowing how it did so. So you use a technical definition of understandable and comprehensible.
00:40:00
Speaker
First, can you talk to us about how we should understand these terms? And then you can discuss why understandable slash comprehensible models are preferable to opaque black boxes.
00:40:15
Speaker
Right. So these AI models, they're basically formulas, very complicated, long formulas with hundreds of variables in there, which are combined in some nonlinear, very complex way. So we understand how these models are learned from the data. But if you would just look at this complex model, it's just a huge formula, very complicated.
00:40:39
Speaker
Well, it's very hard to see with my income, my profession and so on. How do I now go to a decision? Well, through this complex formula. So that's what we mean with comprehensibility. To what extent do you understand this decision-making formula?
00:40:57
Speaker
And so typically, the better the model, the more complex that the formula becomes. And for example, check GPT, GPT-4, it has over a trillion of these kind of variables in its formula. And I wanted to exercise on if you would print that formula, how big would it be?
00:41:15
Speaker
Well, it's not a book, it's a library of millions of books for just that one formula that does this next word prediction. And so these models just, we've come to an age where these models aren't just extremely complex to understand, but they work typically quite well. But even if they work well, you want to know why am I not getting credit? What's happening here in these hundreds of variables that you use about me? What exactly in it lets you to say that I shouldn't be getting credit?
00:41:41
Speaker
In the book, I talk about this story of this famous tech entrepreneur that applied for Apple cards and his wife actually got 20 times lower credit limit than he did.
00:41:54
Speaker
And he said, well, but we share all assets, we've been married for a long time. So how on earth is it possible that she gets this very low credit limits? We called Apple Card, they were very friendly, but they said, I swear we're not discriminating, it's just the algorithm. And you see here the issue that even at Apple Card,
00:42:12
Speaker
It was actually Goldman Sachs who was doing the credit scoring, faces this issue of having a very good predictive model, but without any explanation behind it, you get this controversy. And so the guy was actually blaming and telling them they were discriminating against women. And a formal investigation was launched in New York. Later on, it was revealed that nothing discriminatory was happening. So you see, it's not about fairness, it's about explainability.
00:42:38
Speaker
It would be, well, you've never been a premium customer with us. You never had more than that amount of money on your account. That's why you get a lower credit limit. It would be okay. That makes sense. Thank you. I understand. Moving on. But now without any explanation, it's open for speculation. It must be because it's a woman. And so without explanation, you also do not know. Perhaps it is because it's a woman. And then of course you will want to know that as well that you improve your model.
00:43:07
Speaker
And so these explanations are really needed these days because these models are becoming so complex and we need it in order to trust them and to improve our models. Right, right. That's a good example of how ethical data science is also good data science.

Challenges in Ethical AI Evaluation

00:43:22
Speaker
Okay, we've collected the data, we labeled it, we built our model. Now we want to know, is our model any good? So now we're at stage four, ethical evaluation of data science models.
00:43:34
Speaker
The problems we discuss in the book aren't limited to data science. It's pretty just science in general. There's just many temptations that academics might face when they have some great idea. Or in the case of a data scientist, they have a model that they want to promote or sell or publish or whatever. What are some of these temptations that data scientists might be faced with when they're assessing the value of their models?
00:44:01
Speaker
Yeah, so many things here were wrong. And the first thing I always say is you have to do the data science correctly. And so, for example, how do you evaluate your model? And so typically what we do is what we call, we use a test set that we use to evaluate. If we would use that model now on this test set, how accurate would the models be?
00:44:23
Speaker
And I give the story about predicting the stock market when it goes up or down. And you say, well, it works really, really well. Well, let's use a test set to predict whether the market will go up or down on this test set. And we actually know whether the market went up or down because it happened already. And so we see, well, of these 20 days on which I tested my model, on 15 days, it correctly predicted when it went up or down.
00:44:49
Speaker
I say wow 15 out of 20 really good model that uses so it's what we in data science commonly know is that your test set should be representative so if you took 20 days in a market that's that's very bullish where the market all the time goes up and you have a model that simply always says bye bye bye the market will go up
00:45:12
Speaker
Well, your model will perform very well in that limited time. But if you would really implement that model, it will be totally crazy that every day you're just buying.
00:45:23
Speaker
And so you might think, well, that's a bit exaggerating, but it's exactly the kind of things that happen if you just want to show that you have good results. Well, you have a very small test set or it's basically cherry picked so that you seem to have good results. And we all want good results. We as academics, if we don't show good results, we don't get published. If you're in a company and you built these new data science models,
00:45:48
Speaker
And basically, if they don't work well, what are you doing there? And so if it's not better than what you already have, you also want good results to show, look, my model really works. We should put this AI in place because it works great. And so there's clear incentives to have good results.
00:46:04
Speaker
But if you don't do it correctly, very quickly, you get to these things, you think it's good, but actually it's not good at all. And then you implement it, and then it can be disastrous implications, definitely in the financial or the medical world, really an impact on people if you don't do it correctly. So that's basically the first thing you really should do it correctly. And the other thing is you really should be looking at all these criteria of fairness, accountability, and transparency.
00:46:32
Speaker
And in the book, I also talk about some additional requirements like sustainability. So what's the ecological footprint of your AI model, for example, that you built? Right. There's a wonderful book called The Atlas of AI, which discusses how, even though we like to think that AI is kind of ethereal in some way, it actually requires massive amounts of resources and land, and it produces all kinds of CO2 emissions and all that.
00:47:01
Speaker
However, I'm becoming aware of the time here. So maybe we should move on to the final stage, which is in many ways, maybe the most interesting, ethical deployment of AI models.

Ethical Deployment and Control of AI Models

00:47:15
Speaker
So the data has been collected and pre-processed, the model has been trained, and it's ready to launch. I'll leave this fairly open-ended for you.
00:47:28
Speaker
Shouldn't we just give everyone access to these models? They're ready to go, right? Yeah, well, absolutely not. I think one interesting example is face recognition software. And I think a famous company is Clearview AI.
00:47:44
Speaker
that's basically crawled all these images from all of us from social media anywhere on the web and it's very because of all of that data is really well able from any picture to predict who's in the image and then what are the links to all the social media and then there's a big discussion who should get access to this kind of model so should be just law enforcement people should also be big companies who want to use it to
00:48:10
Speaker
for security reasons should be just anybody. If you say it's just law enforcement, then only law enforcement of the US or Western countries. Where do you say when is a country no longer democratic? For example, if you say it has to be a democratic country, we don't allow
00:48:30
Speaker
in dictatorships that it's misused to go after citizens. But when do you put the cutoff of when a country is still democratic? And these are very hard questions. Again, social questions more than data science questions.
00:48:46
Speaker
which really very often are just thought about after the facts if you don't think about it. So with CleanView AI, for example, there was an issue that investors were allowed to download the app. You have the story of one of these investors, potential investors, going to restaurants, seeing his daughter on a date and he
00:49:05
Speaker
Discreetly took a picture of the date and immediately saw everything about who that guy was and then where he was working and everything. And so you quickly see how creepy it is that these kind of tools are out there. And so I think for some reason in law enforcement, you might be inclined to say face recognition should be used. It's also a big discussion here in Europe around the AI Act. When is it fine? When is it not fine? But there should clearly be limitations on who should get access to it. When should we get access to it?
00:49:35
Speaker
The same with all these chat GPT style tools that we have right now. In Snapchat, for example, you have MyAI. It's a cool kind of things where you can chat with an AI, very similar to chat GPT. But at least here in Belgium, in Europe, if you want to get rid of this MyAI, you have to get the paid version. You cannot remove it. But so you have all these children talking to a chatbot.
00:50:01
Speaker
And we never did any experiments. What is the effect on these children? Do they talk to real friends less? Do they have more friendship, less friendship? What do they tell to these people? So we went through this whole social media where we saw after the facts a year later what the detrimental effect was on mainly teenage women, girls. And now we're doing the same things with these kind of things where we say the AI in deployment who has access, everybody, it's a nice tool.
00:50:28
Speaker
And we should have an ethical discussion about, should we be doing this? And then definitely with minors, I think it's just...
00:50:37
Speaker
that currently is not being had, and it's one of these ethical discussions that we should all be having, should we have this tool available? And definitely, if it's only if you pay for it, that you can get rid of it. And I think definitely parents, for example, should be aware of this kind of thing. So again, it's not just data scientists or managers that should be aware of these issues. I think it's a broad issue for citizens in general to be aware of and talk about.
00:51:01
Speaker
That is one of the issues that is literally keeping me up at night. This prospect that we might find more value in relationships with the AI agents than with other human beings and what that might do to the cohesiveness of society. Okay,

Centralized Control of AI: OpenAI Example

00:51:16
Speaker
so as we move to wrap up,
00:51:19
Speaker
We can have a whole podcast on the question I'm about to ask you. These tools that we're creating are incredibly powerful, akin to sorcery, really. Given this tremendous amount of power to have, how should we think about who should control these models? Do you have any thoughts on that? And maybe you can use the open AI debacle to discuss this.
00:51:44
Speaker
Right, so I think definitely in these large language models like OpenAI has, there's a huge concentration of power with a few players. And so you have OpenAI, you have Facebook, you have Google, but it's so expensive to build these models that just a few of them have access to it.
00:52:03
Speaker
can build them. And you saw with OpenAI the discussion of, well, we still don't know exactly what the whole firing was about of Sam Altman. But it seems to be about to what extent should we increase the commercialization of this kind of technology or not. And you see this one person, Sam Altman, deciding for all of us to what extent is this technology going to move forward towards AGI?
00:52:30
Speaker
or is it being slowed down and you saw that this board was giving some had some doubts about it but even if that board says we shouldn't be doing this are we okay that this couple of people six people decide for all of humanity that we should move forward in this way or not so that's why regulation is needed at least regulation or a broader discussion of
00:52:52
Speaker
who has access to this data and how should they be treating it, what should they be building on top of it, who gets access to it. There's so little discussion about it, this concentration risk with just a few people.
00:53:06
Speaker
And we all kind of idolize these people like, wow, Sam Altman, Elon Musk, they're like these great entrepreneurs, and they know the technology must be right. And so this book was written pre-chat GPT, and it just shows, even if they're not unethical, it can go wrong really, really quickly.
00:53:26
Speaker
And these tools have such a big impact on all of us. These OpenAI innovations, they're coming to Microsoft Office. They're coming to all of us, whether you're using chat GPT or not. And so it's important to think about what do these tools look like? Are there ethical? So I asked for an image for the Christmas party I was giving for my family.
00:53:50
Speaker
And I said, make it a bit funny, where we're 20. And it has all kinds of people in there. I said, well, it's only white people. And it declined to make that image. And so you might say, well, diversity-wise, we should only make images that are diverse. But
00:54:08
Speaker
This company is deciding I will not do that for that use case. In this case, you might tell about this little impact, you just don't have a representative image, but somebody is making this decision for us where we don't have a say in whether that's right or not right.
00:54:25
Speaker
And now it's a decision in a positive thing, we should have diversity, but might be totally in another direction. We totally have no say into what should these models be doing, nor is there any transparent process behind it. There's no set of principles that we follow, or this is why we implement this. And so I feel, again, the solution is not just technology, it's an open, transparent discussion throughout society.
00:55:04
Speaker
Thanks everyone for tuning into the AI and technology ethics podcast. If you found the content interesting or important, please share it with your social networks. It would help us out a lot. The music you're listening to is by the missing shade of blue, which is basically just me. We'll be back next month with a fresh new episode. Until then, be safe, my friends.