Lacramioara Mazilu, N. Paton, Nikolaos Konstantinou, A. Fernandes
{"title":"Fairness in Data Wrangling","authors":"Lacramioara Mazilu, N. Paton, Nikolaos Konstantinou, A. Fernandes","doi":"10.1109/IRI49571.2020.00056","DOIUrl":null,"url":null,"abstract":"At the core of many data analysis processes lies the challenge of properly gathering and transforming data. This problem is known as data wrangling, and it can become even more challenging if the data sources that need to be transformed are heterogeneous and autonomous, i.e., have different origins, and if the output is meant to be used as a training dataset, thus, making it paramount for the dataset to be fair. Given the rise in usage of artificial intelligence (AI) systems for a variety of domains, it is necessary to take into account fairness issues while building these systems. In this paper, we aim to bridge the gap between gathering the data and making the datasets fair by proposing a method for performing data wrangling while considering fairness. To this end, our method comprises a data wrangling pipeline whose behaviour can be adjusted through a set of parameters. Based on the fairness metrics run on the output datasets, the system plans a set of data wrangling interventions with the aim of lowering the bias in the output dataset. The system uses Tabu Search to explore the space of candidate interventions. In this paper we consider two potential sources of dataset bias: those arising from unequal representation of sensitive groups and those arising from hidden biases through proxies for sensitive attributes. The approach is evaluated empirically.","PeriodicalId":93159,"journal":{"name":"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IRI49571.2020.00056","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
At the core of many data analysis processes lies the challenge of properly gathering and transforming data. This problem is known as data wrangling, and it can become even more challenging if the data sources that need to be transformed are heterogeneous and autonomous, i.e., have different origins, and if the output is meant to be used as a training dataset, thus, making it paramount for the dataset to be fair. Given the rise in usage of artificial intelligence (AI) systems for a variety of domains, it is necessary to take into account fairness issues while building these systems. In this paper, we aim to bridge the gap between gathering the data and making the datasets fair by proposing a method for performing data wrangling while considering fairness. To this end, our method comprises a data wrangling pipeline whose behaviour can be adjusted through a set of parameters. Based on the fairness metrics run on the output datasets, the system plans a set of data wrangling interventions with the aim of lowering the bias in the output dataset. The system uses Tabu Search to explore the space of candidate interventions. In this paper we consider two potential sources of dataset bias: those arising from unequal representation of sensitive groups and those arising from hidden biases through proxies for sensitive attributes. The approach is evaluated empirically.