Clustering Heterogeneous Data Values for Data Quality Analysis

IF 2.9 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS ACM Journal of Data and Information Quality Pub Date : 2023-06-22 DOI:10.1145/3603710

Viola Wenz, Arno Kesper, G. Taentzer

{"title":"Clustering Heterogeneous Data Values for Data Quality Analysis","authors":"Viola Wenz, Arno Kesper, G. Taentzer","doi":"10.1145/3603710","DOIUrl":null,"url":null,"abstract":"Data is of high quality if it is fit for its intended purpose. Data heterogeneity can be a major quality problem, as quality aspects such as understandability and consistency can be compromised. Heterogeneity of data values is particularly common when data is manually entered by different people using inadequate control rules. In this case, syntactic and semantic heterogeneity often go hand in hand. Heterogeneity of data values may be a direct result of problems in the acquisition process, quality problems of the underlying data model, or possibly erroneous data transformations. For example, in the cultural heritage domain, it is common to analyze data fields by manually searching lists of data values sorted alphabetically or by number of occurrences. Additionally, search functions such as regular expression matching are used to detect specific patterns. However, this requires a priori knowledge and technical skills that domain experts often do not have. Since such datasets often contain thousands of values, the entire process is very time-consuming. Outliers or subtle differences between values that may be critical to data quality can be easily overlooked. To improve this process of analyzing the quality of data values, we propose a bottom-up human-in-the-loop approach that clusters values of a data field according to syntactic similarity. The clustering is intended to help domain experts explore the heterogeneity of values in a data field and can be configured by domain experts according to their domain knowledge. The overview of the syntactic diversity of the data values gives an impression of the rules and practices of data acquisition as well as their violations. From this, experts can infer potential quality issues with the data acquisition process and system, as well as the data model and data transformations. We outline a proof-of-concept implementation of the approach. Our evaluation found that clustering adds value to data quality analysis, especially for detecting quality problems in data models.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"4 1","pages":"1 - 33"},"PeriodicalIF":2.9000,"publicationDate":"2023-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Journal of Data and Information Quality","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3603710","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Data is of high quality if it is fit for its intended purpose. Data heterogeneity can be a major quality problem, as quality aspects such as understandability and consistency can be compromised. Heterogeneity of data values is particularly common when data is manually entered by different people using inadequate control rules. In this case, syntactic and semantic heterogeneity often go hand in hand. Heterogeneity of data values may be a direct result of problems in the acquisition process, quality problems of the underlying data model, or possibly erroneous data transformations. For example, in the cultural heritage domain, it is common to analyze data fields by manually searching lists of data values sorted alphabetically or by number of occurrences. Additionally, search functions such as regular expression matching are used to detect specific patterns. However, this requires a priori knowledge and technical skills that domain experts often do not have. Since such datasets often contain thousands of values, the entire process is very time-consuming. Outliers or subtle differences between values that may be critical to data quality can be easily overlooked. To improve this process of analyzing the quality of data values, we propose a bottom-up human-in-the-loop approach that clusters values of a data field according to syntactic similarity. The clustering is intended to help domain experts explore the heterogeneity of values in a data field and can be configured by domain experts according to their domain knowledge. The overview of the syntactic diversity of the data values gives an impression of the rules and practices of data acquisition as well as their violations. From this, experts can infer potential quality issues with the data acquisition process and system, as well as the data model and data transformations. We outline a proof-of-concept implementation of the approach. Our evaluation found that clustering adds value to data quality analysis, especially for detecting quality problems in data models.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

聚类异构数据值用于数据质量分析

如果数据符合其预期目的，那么数据就是高质量的。数据异构可能是一个主要的质量问题，因为可理解性和一致性等质量方面可能会受到损害。当不同的人使用不适当的控制规则手动输入数据时，数据值的异质性尤其常见。在这种情况下，语法和语义的异构性通常是齐头并进的。数据值的异构性可能是获取过程中的问题、底层数据模型的质量问题或可能错误的数据转换的直接结果。例如，在文化遗产领域，通常通过手动搜索按字母顺序或按出现次数排序的数据值列表来分析数据字段。此外，正则表达式匹配等搜索函数用于检测特定模式。然而，这需要领域专家通常不具备的先验知识和技术技能。由于这些数据集通常包含数千个值，因此整个过程非常耗时。可能对数据质量至关重要的值之间的异常值或细微差异很容易被忽略。为了改进这一分析数据值质量的过程，我们提出了一种自下而上的human-in-the-loop方法，该方法根据语法相似性对数据字段的值进行聚类。聚类旨在帮助领域专家探索数据领域中值的异质性，并可由领域专家根据其领域知识进行配置。数据值的语法多样性概述了数据获取的规则和实践以及违反这些规则和实践的情况。由此，专家可以推断数据获取过程和系统以及数据模型和数据转换的潜在质量问题。我们概述了该方法的概念验证实现。我们的评估发现，聚类为数据质量分析增加了价值，特别是在检测数据模型中的质量问题时。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊