Clustering Heterogeneous Data Values for Data Quality Analysis

IF 1.5 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS ACM Journal of Data and Information Quality Pub Date : 2023-06-22 DOI:10.1145/3603710
Viola Wenz, Arno Kesper, G. Taentzer
{"title":"Clustering Heterogeneous Data Values for Data Quality Analysis","authors":"Viola Wenz, Arno Kesper, G. Taentzer","doi":"10.1145/3603710","DOIUrl":null,"url":null,"abstract":"Data is of high quality if it is fit for its intended purpose. Data heterogeneity can be a major quality problem, as quality aspects such as understandability and consistency can be compromised. Heterogeneity of data values is particularly common when data is manually entered by different people using inadequate control rules. In this case, syntactic and semantic heterogeneity often go hand in hand. Heterogeneity of data values may be a direct result of problems in the acquisition process, quality problems of the underlying data model, or possibly erroneous data transformations. For example, in the cultural heritage domain, it is common to analyze data fields by manually searching lists of data values sorted alphabetically or by number of occurrences. Additionally, search functions such as regular expression matching are used to detect specific patterns. However, this requires a priori knowledge and technical skills that domain experts often do not have. Since such datasets often contain thousands of values, the entire process is very time-consuming. Outliers or subtle differences between values that may be critical to data quality can be easily overlooked. To improve this process of analyzing the quality of data values, we propose a bottom-up human-in-the-loop approach that clusters values of a data field according to syntactic similarity. The clustering is intended to help domain experts explore the heterogeneity of values in a data field and can be configured by domain experts according to their domain knowledge. The overview of the syntactic diversity of the data values gives an impression of the rules and practices of data acquisition as well as their violations. From this, experts can infer potential quality issues with the data acquisition process and system, as well as the data model and data transformations. We outline a proof-of-concept implementation of the approach. Our evaluation found that clustering adds value to data quality analysis, especially for detecting quality problems in data models.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"4 1","pages":"1 - 33"},"PeriodicalIF":1.5000,"publicationDate":"2023-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Journal of Data and Information Quality","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3603710","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Data is of high quality if it is fit for its intended purpose. Data heterogeneity can be a major quality problem, as quality aspects such as understandability and consistency can be compromised. Heterogeneity of data values is particularly common when data is manually entered by different people using inadequate control rules. In this case, syntactic and semantic heterogeneity often go hand in hand. Heterogeneity of data values may be a direct result of problems in the acquisition process, quality problems of the underlying data model, or possibly erroneous data transformations. For example, in the cultural heritage domain, it is common to analyze data fields by manually searching lists of data values sorted alphabetically or by number of occurrences. Additionally, search functions such as regular expression matching are used to detect specific patterns. However, this requires a priori knowledge and technical skills that domain experts often do not have. Since such datasets often contain thousands of values, the entire process is very time-consuming. Outliers or subtle differences between values that may be critical to data quality can be easily overlooked. To improve this process of analyzing the quality of data values, we propose a bottom-up human-in-the-loop approach that clusters values of a data field according to syntactic similarity. The clustering is intended to help domain experts explore the heterogeneity of values in a data field and can be configured by domain experts according to their domain knowledge. The overview of the syntactic diversity of the data values gives an impression of the rules and practices of data acquisition as well as their violations. From this, experts can infer potential quality issues with the data acquisition process and system, as well as the data model and data transformations. We outline a proof-of-concept implementation of the approach. Our evaluation found that clustering adds value to data quality analysis, especially for detecting quality problems in data models.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
聚类异构数据值用于数据质量分析
如果数据符合其预期目的,那么数据就是高质量的。数据异构可能是一个主要的质量问题,因为可理解性和一致性等质量方面可能会受到损害。当不同的人使用不适当的控制规则手动输入数据时,数据值的异质性尤其常见。在这种情况下,语法和语义的异构性通常是齐头并进的。数据值的异构性可能是获取过程中的问题、底层数据模型的质量问题或可能错误的数据转换的直接结果。例如,在文化遗产领域,通常通过手动搜索按字母顺序或按出现次数排序的数据值列表来分析数据字段。此外,正则表达式匹配等搜索函数用于检测特定模式。然而,这需要领域专家通常不具备的先验知识和技术技能。由于这些数据集通常包含数千个值,因此整个过程非常耗时。可能对数据质量至关重要的值之间的异常值或细微差异很容易被忽略。为了改进这一分析数据值质量的过程,我们提出了一种自下而上的human-in-the-loop方法,该方法根据语法相似性对数据字段的值进行聚类。聚类旨在帮助领域专家探索数据领域中值的异质性,并可由领域专家根据其领域知识进行配置。数据值的语法多样性概述了数据获取的规则和实践以及违反这些规则和实践的情况。由此,专家可以推断数据获取过程和系统以及数据模型和数据转换的潜在质量问题。我们概述了该方法的概念验证实现。我们的评估发现,聚类为数据质量分析增加了价值,特别是在检测数据模型中的质量问题时。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
ACM Journal of Data and Information Quality
ACM Journal of Data and Information Quality COMPUTER SCIENCE, INFORMATION SYSTEMS-
CiteScore
4.10
自引率
4.80%
发文量
0
期刊最新文献
Text2EL+: Expert Guided Event Log Enrichment using Unstructured Text A Catalog of Consumer IoT Device Characteristics for Data Quality Estimation AI explainibility and acceptance; a case study for underwater mine hunting Data quality assessment through a preference model Editorial: Special Issue on Quality Aspects of Data Preparation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1