DQSOps: Data Quality Scoring Operations Framework for Data-Driven Applications

Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering Pub Date : 2023-03-27 DOI:10.1145/3593434.3593445

Firas Bayram, Bestoun S. Ahmed, Erik Hallin, Anton Engman

{"title":"DQSOps: Data Quality Scoring Operations Framework for Data-Driven Applications","authors":"Firas Bayram, Bestoun S. Ahmed, Erik Hallin, Anton Engman","doi":"10.1145/3593434.3593445","DOIUrl":null,"url":null,"abstract":"Data quality assessment has become a prominent component in the successful execution of complex data-driven artificial intelligence (AI) software systems. In practice, real-world applications generate huge volumes of data at speeds. These data streams require analysis and preprocessing before being permanently stored or used in a learning task. Therefore, significant attention has been paid to the systematic management and construction of high-quality datasets. Nevertheless, managing voluminous and high-velocity data streams is usually performed manually (i.e. offline), making it an impractical strategy in production environments. To address this challenge, DataOps has emerged to achieve life-cycle automation of data processes using DevOps principles. However, determining the data quality based on a fitness scale constitutes a complex task within the framework of DataOps. This paper presents a novel Data Quality Scoring Operations (DQSOps) framework that yields a quality score for production data in DataOps workflows. The framework incorporates two scoring approaches, an ML prediction-based approach that predicts the data quality score and a standard-based approach that periodically produces the ground-truth scores based on assessing several data quality dimensions. We deploy the DQSOps framework in a real-world industrial use case. The results show that DQSOps achieves significant computational speedup rates compared to the conventional approach of data quality scoring while maintaining high prediction performance.","PeriodicalId":178596,"journal":{"name":"Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3593434.3593445","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Data quality assessment has become a prominent component in the successful execution of complex data-driven artificial intelligence (AI) software systems. In practice, real-world applications generate huge volumes of data at speeds. These data streams require analysis and preprocessing before being permanently stored or used in a learning task. Therefore, significant attention has been paid to the systematic management and construction of high-quality datasets. Nevertheless, managing voluminous and high-velocity data streams is usually performed manually (i.e. offline), making it an impractical strategy in production environments. To address this challenge, DataOps has emerged to achieve life-cycle automation of data processes using DevOps principles. However, determining the data quality based on a fitness scale constitutes a complex task within the framework of DataOps. This paper presents a novel Data Quality Scoring Operations (DQSOps) framework that yields a quality score for production data in DataOps workflows. The framework incorporates two scoring approaches, an ML prediction-based approach that predicts the data quality score and a standard-based approach that periodically produces the ground-truth scores based on assessing several data quality dimensions. We deploy the DQSOps framework in a real-world industrial use case. The results show that DQSOps achieves significant computational speedup rates compared to the conventional approach of data quality scoring while maintaining high prediction performance.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

dqsop:数据驱动应用程序的数据质量评分操作框架

数据质量评估已成为成功执行复杂数据驱动的人工智能(AI)软件系统的重要组成部分。在实践中，现实世界的应用程序以极快的速度生成大量数据。这些数据流在永久存储或用于学习任务之前需要进行分析和预处理。因此，高质量数据集的系统化管理和建设受到了人们的高度重视。然而，管理大量高速数据流通常是手动执行的(即脱机)，这使得它在生产环境中成为一种不切实际的策略。为了应对这一挑战，DataOps已经出现，使用DevOps原则实现数据过程的生命周期自动化。然而，在DataOps框架内，基于适应度尺度确定数据质量是一项复杂的任务。本文提出了一种新的数据质量评分操作(DQSOps)框架，该框架为DataOps工作流中的生产数据生成质量评分。该框架包含两种评分方法，一种是基于机器学习预测的方法，用于预测数据质量得分;另一种是基于标准的方法，基于评估几个数据质量维度，定期生成真实得分。我们在真实的工业用例中部署dqsop框架。结果表明，与传统的数据质量评分方法相比，dqsop在保持较高预测性能的同时获得了显著的计算加速率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering

自引率

0.00%

发文量