BIGQA:声明式大数据质量评估

IF 1.5 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS ACM Journal of Data and Information Quality Pub Date : 2023-06-13 DOI:10.1145/3603706

Hadi Fadlallah, R. Kilany, Houssein Dhayne, Rami El Haddad, R. Haque, Y. Taher, Ali Jaber

{"title":"BIGQA:声明式大数据质量评估","authors":"Hadi Fadlallah, R. Kilany, Houssein Dhayne, Rami El Haddad, R. Haque, Y. Taher, Ali Jaber","doi":"10.1145/3603706","DOIUrl":null,"url":null,"abstract":"In the big data domain, data quality assessment operations are often complex and must be implementable in a distributed and timely manner. This article tries to generalize the quality assessment operations by providing a new ISO-based declarative data quality assessment framework (BIGQA). BIGQA is a flexible solution that supports data quality assessment in different domains and contexts. It facilitates the planning and execution of big data quality assessment operations for data domain experts and data management specialists at any phase in the data life cycle. This work implements BIGQA to demonstrate its ability to produce customized data quality reports while running efficiently on parallel or distributed computing frameworks. BIGQA generates data quality assessment plans using straightforward operators designed to handle big data and guarantee a high degree of parallelism when executed. Moreover, it allows incremental data quality assessment to avoid reading the whole dataset each time the quality assessment operation is required. The result was validated using radiation wireless sensor data and Stack Overflow users’ data to show that it can be implemented within different contexts. The experiments show a 71% performance improvement over a 1 GB flat file on a single processing machine compared with a non-parallel application and a 75% performance improvement over a 25 GB flat file within a distributed environment compared to a non-distributed application.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"56 1","pages":"1 - 30"},"PeriodicalIF":1.5000,"publicationDate":"2023-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"BIGQA: Declarative Big Data Quality Assessment\",\"authors\":\"Hadi Fadlallah, R. Kilany, Houssein Dhayne, Rami El Haddad, R. Haque, Y. Taher, Ali Jaber\",\"doi\":\"10.1145/3603706\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the big data domain, data quality assessment operations are often complex and must be implementable in a distributed and timely manner. This article tries to generalize the quality assessment operations by providing a new ISO-based declarative data quality assessment framework (BIGQA). BIGQA is a flexible solution that supports data quality assessment in different domains and contexts. It facilitates the planning and execution of big data quality assessment operations for data domain experts and data management specialists at any phase in the data life cycle. This work implements BIGQA to demonstrate its ability to produce customized data quality reports while running efficiently on parallel or distributed computing frameworks. BIGQA generates data quality assessment plans using straightforward operators designed to handle big data and guarantee a high degree of parallelism when executed. Moreover, it allows incremental data quality assessment to avoid reading the whole dataset each time the quality assessment operation is required. The result was validated using radiation wireless sensor data and Stack Overflow users’ data to show that it can be implemented within different contexts. The experiments show a 71% performance improvement over a 1 GB flat file on a single processing machine compared with a non-parallel application and a 75% performance improvement over a 25 GB flat file within a distributed environment compared to a non-distributed application.\",\"PeriodicalId\":44355,\"journal\":{\"name\":\"ACM Journal of Data and Information Quality\",\"volume\":\"56 1\",\"pages\":\"1 - 30\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2023-06-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Journal of Data and Information Quality\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3603706\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Journal of Data and Information Quality","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3603706","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 1

摘要

在大数据领域，数据质量评估操作往往非常复杂，必须以分布式和及时的方式实施。本文试图通过提供一个新的基于iso的声明性数据质量评估框架(BIGQA)来推广质量评估操作。BIGQA是一个灵活的解决方案，支持不同领域和上下文的数据质量评估。它有助于数据领域专家和数据管理专家在数据生命周期的任何阶段规划和执行大数据质量评估操作。这项工作实现了BIGQA，以展示其在并行或分布式计算框架上高效运行时生成定制数据质量报告的能力。BIGQA使用简单的运算符生成数据质量评估计划，用于处理大数据，并保证执行时的高度并行性。此外，它允许增量数据质量评估，避免每次需要进行质量评估操作时读取整个数据集。使用辐射无线传感器数据和Stack Overflow用户数据验证了结果，表明它可以在不同的环境中实现。实验表明，与非并行应用程序相比，在单个处理机器上，1 GB平面文件的性能提高了71%，在分布式环境中，与非分布式应用程序相比，25 GB平面文件的性能提高了75%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

BIGQA: Declarative Big Data Quality Assessment

In the big data domain, data quality assessment operations are often complex and must be implementable in a distributed and timely manner. This article tries to generalize the quality assessment operations by providing a new ISO-based declarative data quality assessment framework (BIGQA). BIGQA is a flexible solution that supports data quality assessment in different domains and contexts. It facilitates the planning and execution of big data quality assessment operations for data domain experts and data management specialists at any phase in the data life cycle. This work implements BIGQA to demonstrate its ability to produce customized data quality reports while running efficiently on parallel or distributed computing frameworks. BIGQA generates data quality assessment plans using straightforward operators designed to handle big data and guarantee a high degree of parallelism when executed. Moreover, it allows incremental data quality assessment to avoid reading the whole dataset each time the quality assessment operation is required. The result was validated using radiation wireless sensor data and Stack Overflow users’ data to show that it can be implemented within different contexts. The experiments show a 71% performance improvement over a 1 GB flat file on a single processing machine compared with a non-parallel application and a 75% performance improvement over a 25 GB flat file within a distributed environment compared to a non-distributed application.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊