Data Validation Utilizing Expert Knowledge and Shape Constraints

F. Bachinger, Lisa Ehrlinger, G. Kronberger, Wolfram Wöß
{"title":"Data Validation Utilizing Expert Knowledge and Shape Constraints","authors":"F. Bachinger, Lisa Ehrlinger, G. Kronberger, Wolfram Wöß","doi":"10.1145/3661826","DOIUrl":null,"url":null,"abstract":"Data validation is a primary concern in any data-driven application, as undetected data errors may negatively affect machine learning models and lead to suboptimal decisions. Data quality issues are usually detected manually by experts, which becomes infeasible and uneconomical for large volumes of data.\n To enable automated data validation, we propose “shape constraint-based data validation”, a novel approach based on machine learning that incorporates expert knowledge in the form of shape constraints. Shape constraints can be used to describe expected (multivariate and nonlinear) patterns in valid data, and enable the detection of invalid data which deviates from these expected patterns. Our approach can be divided into two steps: (1) shape-constrained prediction models are trained on data, and (2) their training error is analyzed to identify invalid data. The training error can be used as an indicator for invalid data because shape-constrained models can fit valid data better than invalid data.\n We evaluate the approach on a benchmark suite consisting of synthetic datasets, which we have published for benchmarking similar data validation approaches. Additionally, we demonstrate the capabilities of the proposed approach with a real-world dataset consisting of measurements from a friction test bench in an industrial setting. Our approach detects subtle data errors that are difficult to identify even for domain experts.","PeriodicalId":517209,"journal":{"name":"Journal of Data and Information Quality","volume":" 1163","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Data and Information Quality","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3661826","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Data validation is a primary concern in any data-driven application, as undetected data errors may negatively affect machine learning models and lead to suboptimal decisions. Data quality issues are usually detected manually by experts, which becomes infeasible and uneconomical for large volumes of data. To enable automated data validation, we propose “shape constraint-based data validation”, a novel approach based on machine learning that incorporates expert knowledge in the form of shape constraints. Shape constraints can be used to describe expected (multivariate and nonlinear) patterns in valid data, and enable the detection of invalid data which deviates from these expected patterns. Our approach can be divided into two steps: (1) shape-constrained prediction models are trained on data, and (2) their training error is analyzed to identify invalid data. The training error can be used as an indicator for invalid data because shape-constrained models can fit valid data better than invalid data. We evaluate the approach on a benchmark suite consisting of synthetic datasets, which we have published for benchmarking similar data validation approaches. Additionally, we demonstrate the capabilities of the proposed approach with a real-world dataset consisting of measurements from a friction test bench in an industrial setting. Our approach detects subtle data errors that are difficult to identify even for domain experts.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用专家知识和形状限制进行数据验证
数据验证是任何数据驱动型应用的首要问题,因为未检测到的数据错误可能会对机器学习模型产生负面影响,并导致次优决策。数据质量问题通常由专家手动检测,这对于海量数据来说既不可行也不经济。为了实现自动数据验证,我们提出了 "基于形状约束的数据验证",这是一种基于机器学习的新方法,它以形状约束的形式纳入了专家知识。形状约束可用于描述有效数据中的预期(多变量和非线性)模式,并能检测出偏离这些预期模式的无效数据。我们的方法可分为两个步骤:(1) 在数据上训练形状约束预测模型,(2) 分析其训练误差以识别无效数据。训练误差可作为无效数据的指标,因为形状约束模型比无效数据更适合有效数据。我们在一个由合成数据集组成的基准套件上对该方法进行了评估。此外,我们还利用由工业环境中摩擦测试台的测量数据组成的真实数据集演示了所提方法的能力。我们的方法能检测出即使是领域专家也难以识别的细微数据错误。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Student Academic Success Prediction Using Learning Management Multimedia Data With Convoluted Features and Ensemble Model Active Learning for Data Quality Control: A Survey Data Validation Utilizing Expert Knowledge and Shape Constraints Editorial: Special Issue on Human in the Loop Data Curation Editor-in-Chief (June 2017–November 2023) Farewell Report
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1