Active Learning for Data Quality Control: A Survey

Na Li, Yiyang Qi, Chaoran Li, Zhiming Zhao
{"title":"Active Learning for Data Quality Control: A Survey","authors":"Na Li, Yiyang Qi, Chaoran Li, Zhiming Zhao","doi":"10.1145/3663369","DOIUrl":null,"url":null,"abstract":"Data quality plays a vital role in scientific research and decision-making across industries. Thus it is crucial to incorporate the data quality control (DQC) process, which comprises various actions and operations to detect and correct data errors. The increasing adoption of machine learning (ML) techniques in different domains has raised concerns about data quality in the ML field. On the other hand, ML’s capability to uncover complex patterns makes it suitable for addressing challenges involved in the DQC process. However, supervised learning methods demand abundant labeled data, while unsupervised learning methods heavily rely on the underlying distribution of the data. Active learning (AL) provides a promising solution by proactively selecting data points for inspection, thus reducing the burden of data labeling for domain experts. Therefore, this survey focuses on applying AL to DQC. Starting with a review of common data quality issues and solutions in the ML field, we aim to enhance the understanding of current quality assessment methods. We then present two scenarios to illustrate the adoption of AL into the DQC systems on the anomaly detection task, including pool-based and stream-based approaches. Finally, we provide the remaining challenges and research opportunities in this field.","PeriodicalId":517209,"journal":{"name":"Journal of Data and Information Quality","volume":"6 23","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Data and Information Quality","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3663369","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Data quality plays a vital role in scientific research and decision-making across industries. Thus it is crucial to incorporate the data quality control (DQC) process, which comprises various actions and operations to detect and correct data errors. The increasing adoption of machine learning (ML) techniques in different domains has raised concerns about data quality in the ML field. On the other hand, ML’s capability to uncover complex patterns makes it suitable for addressing challenges involved in the DQC process. However, supervised learning methods demand abundant labeled data, while unsupervised learning methods heavily rely on the underlying distribution of the data. Active learning (AL) provides a promising solution by proactively selecting data points for inspection, thus reducing the burden of data labeling for domain experts. Therefore, this survey focuses on applying AL to DQC. Starting with a review of common data quality issues and solutions in the ML field, we aim to enhance the understanding of current quality assessment methods. We then present two scenarios to illustrate the adoption of AL into the DQC systems on the anomaly detection task, including pool-based and stream-based approaches. Finally, we provide the remaining challenges and research opportunities in this field.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
数据质量控制的主动学习:一项调查
数据质量在各行各业的科学研究和决策中发挥着至关重要的作用。因此,纳入数据质量控制(DQC)流程至关重要,该流程包括各种检测和纠正数据错误的行动和操作。随着机器学习(ML)技术在不同领域的应用日益广泛,人们对 ML 领域的数据质量产生了担忧。另一方面,ML 发现复杂模式的能力使其适合应对 DQC 过程中的挑战。然而,监督学习方法需要大量的标注数据,而无监督学习方法则严重依赖于数据的底层分布。主动学习(AL)通过主动选择数据点进行检测,从而减轻了领域专家的数据标注负担,提供了一种很有前景的解决方案。因此,本调查侧重于将 AL 应用于 DQC。我们首先回顾了 ML 领域常见的数据质量问题和解决方案,旨在加深对当前质量评估方法的理解。然后,我们介绍了两种情景,以说明在异常检测任务的 DQC 系统中采用 AL 的情况,包括基于池的方法和基于流的方法。最后,我们提出了这一领域仍然存在的挑战和研究机会。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Student Academic Success Prediction Using Learning Management Multimedia Data With Convoluted Features and Ensemble Model Active Learning for Data Quality Control: A Survey Data Validation Utilizing Expert Knowledge and Shape Constraints Editorial: Special Issue on Human in the Loop Data Curation Editor-in-Chief (June 2017–November 2023) Farewell Report
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1