Context-aware Big Data Quality Assessment: A Scoping Review

IF 1.5 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS ACM Journal of Data and Information Quality Pub Date : 2023-06-13 DOI:10.1145/3603707

Hadi Fadlallah, R. Kilany, Houssein Dhayne, Rami El Haddad, R. Haque, Y. Taher, Ali H. Jaber

{"title":"Context-aware Big Data Quality Assessment: A Scoping Review","authors":"Hadi Fadlallah, R. Kilany, Houssein Dhayne, Rami El Haddad, R. Haque, Y. Taher, Ali H. Jaber","doi":"10.1145/3603707","DOIUrl":null,"url":null,"abstract":"The term data quality refers to measuring the fitness of data regarding the intended usage. Poor data quality leads to inadequate, inconsistent, and erroneous decisions that could escalate the computational cost, cause a decline in profits, and cause customer churn. Thus, data quality is crucial for researchers and industry practitioners. Different factors drive the assessment of data quality. Data context is deemed one of the key factors due to the contextual diversity of real-world use cases of various entities such as people and organizations. Data used in a specific context (e.g., an organization policy) may need to be more efficacious for another context. Hence, implementing a data quality assessment solution in different contexts is challenging. Traditional technologies for data quality assessment reached the pinnacle of maturity. Existing solutions can solve most of the quality issues. The data context in these solutions is defined as validation rules applied within the ETL (extract, transform, load) process, i.e., the data warehousing process. In contrast to traditional data quality management, it is impossible to specify all the data semantics beforehand for big data. We need context-aware data quality rules to detect semantic errors in a massive amount of heterogeneous data generated at high speed. While many researchers tackle the quality issues of big data, they define the data context from a specific standpoint. Although data quality is a longstanding research issue in academia and industries, it remains an open issue, especially with the advent of big data, which has fostered the challenge of data quality assessment more than ever. This article provides a scoping review to study the existing context-aware data quality assessment solutions, starting with the existing big data quality solutions in general and then covering context-aware solutions. The strength and weaknesses of such solutions are outlined and discussed. The survey showed that none of the existing data quality assessment solutions could guarantee context awareness with the ability to handle big data. Notably, each solution dealt only with a partial view of the context. We compared the existing quality models and solutions to reach a comprehensive view covering the aspects of context awareness when assessing data quality. This led us to a set of recommendations framed in a methodological framework shaping the design and implementation of any context-aware data quality service for big data. Open challenges are then identified and discussed.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"8 1","pages":"1 - 33"},"PeriodicalIF":1.5000,"publicationDate":"2023-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Journal of Data and Information Quality","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3603707","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 2

Abstract

The term data quality refers to measuring the fitness of data regarding the intended usage. Poor data quality leads to inadequate, inconsistent, and erroneous decisions that could escalate the computational cost, cause a decline in profits, and cause customer churn. Thus, data quality is crucial for researchers and industry practitioners. Different factors drive the assessment of data quality. Data context is deemed one of the key factors due to the contextual diversity of real-world use cases of various entities such as people and organizations. Data used in a specific context (e.g., an organization policy) may need to be more efficacious for another context. Hence, implementing a data quality assessment solution in different contexts is challenging. Traditional technologies for data quality assessment reached the pinnacle of maturity. Existing solutions can solve most of the quality issues. The data context in these solutions is defined as validation rules applied within the ETL (extract, transform, load) process, i.e., the data warehousing process. In contrast to traditional data quality management, it is impossible to specify all the data semantics beforehand for big data. We need context-aware data quality rules to detect semantic errors in a massive amount of heterogeneous data generated at high speed. While many researchers tackle the quality issues of big data, they define the data context from a specific standpoint. Although data quality is a longstanding research issue in academia and industries, it remains an open issue, especially with the advent of big data, which has fostered the challenge of data quality assessment more than ever. This article provides a scoping review to study the existing context-aware data quality assessment solutions, starting with the existing big data quality solutions in general and then covering context-aware solutions. The strength and weaknesses of such solutions are outlined and discussed. The survey showed that none of the existing data quality assessment solutions could guarantee context awareness with the ability to handle big data. Notably, each solution dealt only with a partial view of the context. We compared the existing quality models and solutions to reach a comprehensive view covering the aspects of context awareness when assessing data quality. This led us to a set of recommendations framed in a methodological framework shaping the design and implementation of any context-aware data quality service for big data. Open challenges are then identified and discussed.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

上下文感知大数据质量评估:范围审查

术语数据质量指的是测量数据在预期用途方面的适应性。糟糕的数据质量会导致不充分、不一致和错误的决策，这可能会增加计算成本，导致利润下降，并导致客户流失。因此，数据质量对研究人员和行业从业者至关重要。不同的因素驱动着数据质量的评估。由于人员和组织等各种实体的实际用例的上下文多样性，数据上下文被认为是关键因素之一。在特定上下文中使用的数据(例如，组织策略)可能需要在另一个上下文中更有效。因此，在不同的上下文中实现数据质量评估解决方案是具有挑战性的。传统的数据质量评估技术达到了成熟的顶峰。现有的解决方案可以解决大多数质量问题。这些解决方案中的数据上下文被定义为应用于ETL(提取、转换、加载)过程(即数据仓库过程)中的验证规则。与传统的数据质量管理相比，大数据不可能预先规定所有的数据语义。我们需要上下文感知的数据质量规则来检测高速生成的大量异构数据中的语义错误。虽然许多研究人员解决大数据的质量问题，但他们从特定的角度定义数据上下文。虽然数据质量是学术界和工业界长期以来的研究问题，但它仍然是一个开放的问题，特别是随着大数据的出现，数据质量评估的挑战比以往任何时候都更大。本文提供了一个范围审查，以研究现有的上下文感知数据质量评估解决方案，从现有的一般大数据质量解决方案开始，然后涵盖上下文感知解决方案。概述并讨论了这些解决方案的优缺点。调查显示，现有的数据质量评估解决方案都无法保证具有处理大数据能力的上下文感知。值得注意的是，每个解决方案只处理上下文的部分视图。我们比较了现有的质量模型和解决方案，以在评估数据质量时获得涵盖上下文感知方面的全面视图。这导致我们在方法论框架中提出了一组建议，这些建议塑造了大数据上下文感知数据质量服务的设计和实现。然后确定和讨论开放的挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊