2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...最新文献

Modeling of Clinical Mammography Recognition 临床乳房x线摄影识别的建模

2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...

Pub Date : 2020-08-01 DOI: 10.1109/IRI49571.2020.00068

Kuo-Chung Chu, Po-Yao Tsai, Tien-Yu Chang, Yu-Shu Wu

Breast cancer screening can detect and treat early, mammography is one of popular screening methods. Recognition of mammography image depends on the radiologist, but human interpretation of mammography image has its limitations. Recently, for precision medicine, deep learning technology is applied on medical images to reduce the risk of the interpretation on breast lesion types (BIRADS, Breast Imaging Reporting and Data System, divided into 0 to 6 categories). This study proposes a mammography recognition model that is based on deep learning method to support clinical diagnosis of breast cancer. The model is try to improve medical quality.

乳腺癌筛查可以早期发现和治疗，乳房x光检查是流行的筛查方法之一。乳房x光图像的识别依赖于放射科医生，但人类对乳房x光图像的解释有其局限性。最近，在精准医疗方面，深度学习技术被应用于医学图像上，以降低对乳腺病变类型的解释风险(BIRADS, breast Imaging Reporting and Data System，分为0 - 6类)。本研究提出了一种基于深度学习方法的乳腺x线摄影识别模型，以支持乳腺癌的临床诊断。该模式旨在提高医疗质量。

引用次数: 0

Relating the Empirical Foundations of Attack Generation and Vulnerability Discovery 关联攻击生成和漏洞发现的经验基础

2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...

Pub Date : 2020-08-01 DOI: 10.1109/IRI49571.2020.00014

Tyler Westland, Nan Niu, R. Jha, David Kapp, T. Kebede

Automatically generating exploits for attacks receives much attention in security testing and auditing. However, little is known about the continuous effect of automatic attack generation and detection. In this paper, we develop an analytic model to understand the cost-benefit tradeoffs in light of the process of vulnerability discovery. We develop a three-phased model, suggesting that the cumulative malware detection has a productive period before the rate of gain flattens. As the detection mechanisms co-evolve, the gain will likely increase. We evaluate our analytic model by using an anti-virus tool to detect the thousands of Trojans automatically created. The anti-virus scanning results over five months show the validity of the model and point out future research directions.

自动生成攻击漏洞在安全测试和审计中受到广泛关注。然而，人们对自动攻击生成和检测的持续影响知之甚少。在本文中，我们建立了一个分析模型，以了解在漏洞发现过程中的成本效益权衡。我们开发了一个三个阶段的模型，表明累积恶意软件检测在增益率趋于平缓之前有一个生产期。随着检测机制的共同发展，增益可能会增加。我们通过使用反病毒工具来检测自动创建的数千个木马来评估我们的分析模型。5个月的反病毒扫描结果表明了该模型的有效性，并指出了今后的研究方向。

引用次数: 1

RNN-VED for Reducing False Positive Alerts in Host-based Anomaly Detection Systems 在基于主机的异常检测系统中减少误报的RNN-VED

2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...

Pub Date : 2020-08-01 DOI: 10.1109/IRI49571.2020.00011

Lydia Bouzar-Benlabiod, S. Rubin, Kahina Belaidi, Nour ElHouda Haddar

Host-based Intrusion Detection Systems HIDS are often based on anomaly detection. Several studies deal with anomaly detection by analyzing the system-call traces and get good detection rates but also a high rate off alse positives. In this paper, we propose a new anomaly detection approach applied on the system-call traces. The normal behavior learning is done using a Sequence to sequence model based on a Variational Encoder-Decoder (VED) architecture that integrates Recurrent Neural Networks (RNN) cells. We exploit the semantics behind the invoking order of system-calls that are then seen as sentences. A preprocessing phase is added to structure and optimize the model input-data representation. After the learning step, a one-class classification is run to categorize the sequences as normal or abnormal. The architecture may be used for predicting abnormal behaviors. The tests are achieved on the ADFA-LD dataset.

基于主机的入侵检测系统通常是基于异常检测。一些研究通过分析系统调用轨迹来处理异常检测，得到了很好的检测率，但也有很高的误报率。在本文中，我们提出了一种新的应用于系统调用轨迹的异常检测方法。正常行为学习使用基于变分编码器-解码器(VED)架构的序列到序列模型，该模型集成了循环神经网络(RNN)细胞。我们利用系统调用调用顺序背后的语义，然后将其视为句子。在模型输入数据表示的结构和优化中增加了预处理阶段。在学习步骤之后，运行单类分类，将序列分类为正常或异常。该体系结构可用于预测异常行为。测试是在ADFA-LD数据集上实现的。

{"title":"RNN-VED for Reducing False Positive Alerts in Host-based Anomaly Detection Systems","authors":"Lydia Bouzar-Benlabiod, S. Rubin, Kahina Belaidi, Nour ElHouda Haddar","doi":"10.1109/IRI49571.2020.00011","DOIUrl":"https://doi.org/10.1109/IRI49571.2020.00011","url":null,"abstract":"Host-based Intrusion Detection Systems HIDS are often based on anomaly detection. Several studies deal with anomaly detection by analyzing the system-call traces and get good detection rates but also a high rate off alse positives. In this paper, we propose a new anomaly detection approach applied on the system-call traces. The normal behavior learning is done using a Sequence to sequence model based on a Variational Encoder-Decoder (VED) architecture that integrates Recurrent Neural Networks (RNN) cells. We exploit the semantics behind the invoking order of system-calls that are then seen as sentences. A preprocessing phase is added to structure and optimize the model input-data representation. After the learning step, a one-class classification is run to categorize the sequences as normal or abnormal. The architecture may be used for predicting abnormal behaviors. The tests are achieved on the ADFA-LD dataset.","PeriodicalId":93159,"journal":{"name":"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...","volume":"26 1","pages":"17-24"},"PeriodicalIF":0.0,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76781284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Multimodal Information Integration for Indoor Navigation Using a Smartphone 基于智能手机的室内导航多模态信息集成

2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...

Pub Date : 2020-08-01 DOI: 10.1109/IRI49571.2020.00017

Yaohua Chang, Jin Chen, Tyler Franklin, Lei Zhang, Arber Ruci, Hao Tang, Zhigang Zhu

We propose an accessible indoor navigation application. The solution integrates information of floor plans, Bluetooth beacons, Wi-Fi/cellular data connectivity, 2D/3D visual models, and user preferences. Hybrid models of interiors are created in a modeling stage with Wi-/cellular data connectivity, beacon signal strength, and a 3D spatial model. This data is collected, as the modeler walks through the building, and is mapped to the floor plan. Client-server architecture allows scaling to large areas by lazy-loading models according to beacon signals and/or adjacent region proximity. During the navigation stage, a user with the designed mobile app is localized within the floor plan, using visual, connectivity, and user preference data, along an optimal route to their destination. User interfaces for both modeling and navigation use visual, audio, and haptic feedback for targeted users. While the current pandemic event precludes our user study, we describe its design and preliminary results.

我们提出了一个可访问的室内导航应用程序。该解决方案集成了平面图、蓝牙信标、Wi-Fi/蜂窝数据连接、2D/3D视觉模型和用户偏好等信息。室内混合模型是在具有Wi /蜂窝数据连接，信标信号强度和3D空间模型的建模阶段创建的。当建模人员穿过建筑物时，收集这些数据，并将其映射到平面图上。客户端-服务器架构允许根据信标信号和/或邻近区域的接近程度通过延迟加载模型扩展到大面积。在导航阶段，使用设计的移动应用程序的用户在平面图中进行本地化，使用视觉，连接性和用户偏好数据，沿着最佳路线到达目的地。建模和导航的用户界面为目标用户使用视觉、音频和触觉反馈。虽然目前的大流行事件妨碍了我们的用户研究，但我们描述了其设计和初步结果。

{"title":"Multimodal Information Integration for Indoor Navigation Using a Smartphone","authors":"Yaohua Chang, Jin Chen, Tyler Franklin, Lei Zhang, Arber Ruci, Hao Tang, Zhigang Zhu","doi":"10.1109/IRI49571.2020.00017","DOIUrl":"https://doi.org/10.1109/IRI49571.2020.00017","url":null,"abstract":"We propose an accessible indoor navigation application. The solution integrates information of floor plans, Bluetooth beacons, Wi-Fi/cellular data connectivity, 2D/3D visual models, and user preferences. Hybrid models of interiors are created in a modeling stage with Wi-/cellular data connectivity, beacon signal strength, and a 3D spatial model. This data is collected, as the modeler walks through the building, and is mapped to the floor plan. Client-server architecture allows scaling to large areas by lazy-loading models according to beacon signals and/or adjacent region proximity. During the navigation stage, a user with the designed mobile app is localized within the floor plan, using visual, connectivity, and user preference data, along an optimal route to their destination. User interfaces for both modeling and navigation use visual, audio, and haptic feedback for targeted users. While the current pandemic event precludes our user study, we describe its design and preliminary results.","PeriodicalId":93159,"journal":{"name":"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...","volume":"68 1","pages":"59-66"},"PeriodicalIF":0.0,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78328923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Wrapping a NoSQL Datastore for Stream Analytics 包装流分析的NoSQL数据存储

2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...

Pub Date : 2020-08-01 DOI: 10.1109/IRI49571.2020.00050

Khalid Mahmood, Kjell Orsborn, T. Risch

With the advent of the Industrial Internet of Things (IIoT) and Industrial Analytics, numerous application scenarios emerge, where business and mission-critical decisions depend upon large scale analytics of sensor streams. However, very large volumes of data from data streams generated at a high rate pose substantial challenges in providing scalable analytics from existing Database Management Systems (DBMS). While scalability can be provided by high-performance distributed datastores, due to the simple query operations, access to high-level query-based data analytics is usually limited. This work combines high-level query-based data analytics capabilities with high-performance distributed scalability by applying a wrapper-mediator approach. The Amos II extensible main-memory DBMS provides online query processing data analytics engine in front of the MongoDB distributed NoSQL datastore to support large-scale distributed data analytics over persisted data streams. Thus, the implemented system enables query-based online data stream analytics over persisted data streams stored/logged in distributed NoSQL datastores.

随着工业物联网(IIoT)和工业分析的出现，出现了许多应用场景，其中业务和关键任务决策依赖于对传感器流的大规模分析。然而，从高速生成的数据流中产生的大量数据对现有数据库管理系统(DBMS)提供可扩展的分析提出了实质性的挑战。虽然高性能分布式数据存储可以提供可伸缩性，但由于查询操作简单，对基于查询的高级数据分析的访问通常受到限制。这项工作通过应用包装-中介方法，将基于查询的高级数据分析功能与高性能分布式可伸缩性相结合。Amos II可扩展主存DBMS在MongoDB分布式NoSQL数据存储前提供在线查询处理数据分析引擎，支持对持久化数据流进行大规模分布式数据分析。因此，所实现的系统支持对分布式NoSQL数据存储中存储/登录的持久数据流进行基于查询的在线数据流分析。

{"title":"Wrapping a NoSQL Datastore for Stream Analytics","authors":"Khalid Mahmood, Kjell Orsborn, T. Risch","doi":"10.1109/IRI49571.2020.00050","DOIUrl":"https://doi.org/10.1109/IRI49571.2020.00050","url":null,"abstract":"With the advent of the Industrial Internet of Things (IIoT) and Industrial Analytics, numerous application scenarios emerge, where business and mission-critical decisions depend upon large scale analytics of sensor streams. However, very large volumes of data from data streams generated at a high rate pose substantial challenges in providing scalable analytics from existing Database Management Systems (DBMS). While scalability can be provided by high-performance distributed datastores, due to the simple query operations, access to high-level query-based data analytics is usually limited. This work combines high-level query-based data analytics capabilities with high-performance distributed scalability by applying a wrapper-mediator approach. The Amos II extensible main-memory DBMS provides online query processing data analytics engine in front of the MongoDB distributed NoSQL datastore to support large-scale distributed data analytics over persisted data streams. Thus, the implemented system enables query-based online data stream analytics over persisted data streams stored/logged in distributed NoSQL datastores.","PeriodicalId":93159,"journal":{"name":"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...","volume":"40 1","pages":"301-305"},"PeriodicalIF":0.0,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79886938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Quality not Quantity! A Qualitative Evaluation and Proposal for Understanding the Depth of Audience “Knowledge” Post Data Extraction 质量不是数量!了解受众“知识”后数据提取深度的定性评价与建议

2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...

Pub Date : 2020-08-01 DOI: 10.1109/IRI49571.2020.00031

Kimberley Hemmings-Jarrett, Terryann Barnett, Julian Jarrett, M. Blake, Denise E. Agosto

Knowledge is defined as…the result of machine extracted patterns; humans making sense of their environment; information generated and aggregated from software services or as the lowest form of human cognition. Different perspectives, different domains, but one concept. Information scientists are often concerned with retrieving knowledge from data sources and sharing that knowledge with concerned stakeholders; with such differing views on what qualifies as knowledge a cross-domain approach might prove beneficial. This work is a qualitative assessment of the layers of knowledge intended to bridge the gap between the analyst and their intended or unintended audiences. It examines the benefit of abstracting concepts used in the education discipline to justify including a post-evaluation stage to the Knowledge Discovered through Databases (KDD) framework. It also intends to promote awareness of the various human cognitive capacities and provide a useful approach for communicating and evaluating machine-extracted knowledge that supports higher order thinking.

知识被定义为……机器提取模式的结果;人类理解他们的环境;由软件服务产生和聚合的信息，或作为人类认知的最低形式。不同的视角，不同的领域，但只有一个概念。信息科学家经常关注从数据源中检索知识并与相关利益相关者共享知识;由于对什么是知识有不同的看法，跨领域的方法可能是有益的。这项工作是对知识层的定性评估，旨在弥合分析师与其预期或非预期受众之间的差距。它考察了在教育学科中使用抽象概念的好处，以证明在通过数据库发现的知识(KDD)框架中包括一个后评估阶段是合理的。它还旨在提高人们对各种人类认知能力的认识，并提供一种有用的方法来交流和评估支持高阶思维的机器提取的知识。

{"title":"Quality not Quantity! A Qualitative Evaluation and Proposal for Understanding the Depth of Audience “Knowledge” Post Data Extraction","authors":"Kimberley Hemmings-Jarrett, Terryann Barnett, Julian Jarrett, M. Blake, Denise E. Agosto","doi":"10.1109/IRI49571.2020.00031","DOIUrl":"https://doi.org/10.1109/IRI49571.2020.00031","url":null,"abstract":"Knowledge is defined as…the result of machine extracted patterns; humans making sense of their environment; information generated and aggregated from software services or as the lowest form of human cognition. Different perspectives, different domains, but one concept. Information scientists are often concerned with retrieving knowledge from data sources and sharing that knowledge with concerned stakeholders; with such differing views on what qualifies as knowledge a cross-domain approach might prove beneficial. This work is a qualitative assessment of the layers of knowledge intended to bridge the gap between the analyst and their intended or unintended audiences. It examines the benefit of abstracting concepts used in the education discipline to justify including a post-evaluation stage to the Knowledge Discovered through Databases (KDD) framework. It also intends to promote awareness of the various human cognitive capacities and provide a useful approach for communicating and evaluating machine-extracted knowledge that supports higher order thinking.","PeriodicalId":93159,"journal":{"name":"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...","volume":"175 1","pages":"164-171"},"PeriodicalIF":0.0,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76965553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Semantic Data Understanding with Character Level Learning 基于字符级学习的语义数据理解

2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...

Pub Date : 2020-08-01 DOI: 10.1109/IRI49571.2020.00043

Michael J. Mior, K. Pu

Databases are growing in size and complexity. With the emergence of data lakes, databases have become open, fast evolving and highly heterogeneous. Understanding the complex relationships among different entity types in such scenarios is both challenging and necessary to data scientists. We propose an approach that utilizes a convolutional neural network to learn patterns associated with each entity type in the database at the character level. We demonstrate that the learned character-level patterns can capture sufficient semantic information for many useful applications including data lake schema exploration, and interactive data cleaning.

数据库的规模和复杂性都在不断增长。随着数据湖的出现，数据库变得开放、快速发展和高度异构。理解这些场景中不同实体类型之间的复杂关系对数据科学家来说既是挑战又是必要的。我们提出了一种利用卷积神经网络在字符级别学习与数据库中每个实体类型相关的模式的方法。我们证明了学习到的字符级模式可以为许多有用的应用捕获足够的语义信息，包括数据湖模式探索和交互式数据清理。

引用次数: 0

IRI 2020 Commentary

2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...

Pub Date : 2020-08-01 DOI: 10.1109/iri49571.2020.00001

引用次数: 0

IRI 2020 Index

2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...

Pub Date : 2020-08-01 DOI: 10.1109/iri49571.2020.00077

引用次数: 0

Fairness in Data Wrangling 数据争用中的公平性

2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...

Pub Date : 2020-08-01 DOI: 10.1109/IRI49571.2020.00056

Lacramioara Mazilu, N. Paton, Nikolaos Konstantinou, A. Fernandes

At the core of many data analysis processes lies the challenge of properly gathering and transforming data. This problem is known as data wrangling, and it can become even more challenging if the data sources that need to be transformed are heterogeneous and autonomous, i.e., have different origins, and if the output is meant to be used as a training dataset, thus, making it paramount for the dataset to be fair. Given the rise in usage of artificial intelligence (AI) systems for a variety of domains, it is necessary to take into account fairness issues while building these systems. In this paper, we aim to bridge the gap between gathering the data and making the datasets fair by proposing a method for performing data wrangling while considering fairness. To this end, our method comprises a data wrangling pipeline whose behaviour can be adjusted through a set of parameters. Based on the fairness metrics run on the output datasets, the system plans a set of data wrangling interventions with the aim of lowering the bias in the output dataset. The system uses Tabu Search to explore the space of candidate interventions. In this paper we consider two potential sources of dataset bias: those arising from unequal representation of sensitive groups and those arising from hidden biases through proxies for sensitive attributes. The approach is evaluated empirically.

许多数据分析过程的核心是正确收集和转换数据的挑战。这个问题被称为数据争用，如果需要转换的数据源是异构和自治的，即具有不同的来源，并且如果输出意味着要用作训练数据集，那么数据集的公平性至关重要，那么它就会变得更具挑战性。鉴于人工智能(AI)系统在各种领域的使用不断增加，在构建这些系统时有必要考虑公平性问题。在本文中，我们的目标是通过提出一种在考虑公平性的情况下执行数据争用的方法来弥合收集数据和使数据集公平之间的差距。为此，我们的方法包括一个数据争用管道，其行为可以通过一组参数进行调整。基于在输出数据集上运行的公平性指标，系统计划了一组数据争用干预措施，目的是降低输出数据集中的偏差。该系统使用禁忌搜索来探索候选干预的空间。在本文中，我们考虑了数据集偏差的两个潜在来源:敏感群体的不平等代表和通过敏感属性代理产生的隐藏偏差。对该方法进行了实证评价。

{"title":"Fairness in Data Wrangling","authors":"Lacramioara Mazilu, N. Paton, Nikolaos Konstantinou, A. Fernandes","doi":"10.1109/IRI49571.2020.00056","DOIUrl":"https://doi.org/10.1109/IRI49571.2020.00056","url":null,"abstract":"At the core of many data analysis processes lies the challenge of properly gathering and transforming data. This problem is known as data wrangling, and it can become even more challenging if the data sources that need to be transformed are heterogeneous and autonomous, i.e., have different origins, and if the output is meant to be used as a training dataset, thus, making it paramount for the dataset to be fair. Given the rise in usage of artificial intelligence (AI) systems for a variety of domains, it is necessary to take into account fairness issues while building these systems. In this paper, we aim to bridge the gap between gathering the data and making the datasets fair by proposing a method for performing data wrangling while considering fairness. To this end, our method comprises a data wrangling pipeline whose behaviour can be adjusted through a set of parameters. Based on the fairness metrics run on the output datasets, the system plans a set of data wrangling interventions with the aim of lowering the bias in the output dataset. The system uses Tabu Search to explore the space of candidate interventions. In this paper we consider two potential sources of dataset bias: those arising from unequal representation of sensitive groups and those arising from hidden biases through proxies for sensitive attributes. The approach is evaluated empirically.","PeriodicalId":93159,"journal":{"name":"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...","volume":"23-24 1","pages":"341-348"},"PeriodicalIF":0.0,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89368482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6