Risk-based data validation in machine learning-based software systems

Harald Foidl, M. Felderer
{"title":"Risk-based data validation in machine learning-based software systems","authors":"Harald Foidl, M. Felderer","doi":"10.1145/3340482.3342743","DOIUrl":null,"url":null,"abstract":"Data validation is an essential requirement to ensure the reliability and quality of Machine Learning-based Software Systems. However, an exhaustive validation of all data fed to these systems (i.e. up to several thousand features) is practically unfeasible. In addition, there has been little discussion about methods that support software engineers of such systems in determining how thorough to validate each feature (i.e. data validation rigor). Therefore, this paper presents a conceptual data validation approach that prioritizes features based on their estimated risk of poor data quality. The risk of poor data quality is determined by the probability that a feature is of low data quality and the impact of this low (data) quality feature on the result of the machine learning model. Three criteria are presented to estimate the probability of low data quality (Data Source Quality, Data Smells, Data Pipeline Quality). To determine the impact of low (data) quality features, the importance of features according to the performance of the machine learning model (i.e. Feature Importance) is utilized. The presented approach provides decision support (i.e. data validation prioritization and rigor) for software engineers during the implementation of data validation techniques in the course of deploying a trained machine learning model and its software stack.","PeriodicalId":254040,"journal":{"name":"Proceedings of the 3rd ACM SIGSOFT International Workshop on Machine Learning Techniques for Software Quality Evaluation","volume":"97 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 3rd ACM SIGSOFT International Workshop on Machine Learning Techniques for Software Quality Evaluation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3340482.3342743","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 24

Abstract

Data validation is an essential requirement to ensure the reliability and quality of Machine Learning-based Software Systems. However, an exhaustive validation of all data fed to these systems (i.e. up to several thousand features) is practically unfeasible. In addition, there has been little discussion about methods that support software engineers of such systems in determining how thorough to validate each feature (i.e. data validation rigor). Therefore, this paper presents a conceptual data validation approach that prioritizes features based on their estimated risk of poor data quality. The risk of poor data quality is determined by the probability that a feature is of low data quality and the impact of this low (data) quality feature on the result of the machine learning model. Three criteria are presented to estimate the probability of low data quality (Data Source Quality, Data Smells, Data Pipeline Quality). To determine the impact of low (data) quality features, the importance of features according to the performance of the machine learning model (i.e. Feature Importance) is utilized. The presented approach provides decision support (i.e. data validation prioritization and rigor) for software engineers during the implementation of data validation techniques in the course of deploying a trained machine learning model and its software stack.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于机器学习的软件系统中基于风险的数据验证
数据验证是保证基于机器学习的软件系统可靠性和质量的基本要求。然而,对提供给这些系统的所有数据(即多达数千个特征)进行详尽的验证实际上是不可行的。此外,关于支持此类系统的软件工程师确定如何彻底验证每个特性(即数据验证的严谨性)的方法的讨论很少。因此,本文提出了一种概念性数据验证方法,该方法根据数据质量差的估计风险对特征进行优先级排序。数据质量差的风险是由一个特征是低数据质量的概率和这个低(数据)质量特征对机器学习模型结果的影响决定的。提出了三个评估低数据质量概率的标准(数据源质量、数据气味、数据管道质量)。为了确定低(数据)质量特征的影响,根据机器学习模型的性能使用特征的重要性(即特征重要性)。提出的方法为软件工程师在部署训练有素的机器学习模型及其软件堆栈的过程中实现数据验证技术提供决策支持(即数据验证优先级和严谨性)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Leveraging mutants for automatic prediction of metamorphic relations using machine learning On the role of data balancing for machine learning-based code smell detection Classifying non-functional requirements using RNN variants for quality software development A machine learning based automatic folding of dynamically typed languages Risk-based data validation in machine learning-based software systems
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1