On the Data Quality and Imbalance in Machine Learning-based Design and Manufacturing—A Systematic Review

IF 11.6 1区 工程技术 Q1 ENGINEERING, MULTIDISCIPLINARY Engineering Pub Date : 2025-02-01 Epub Date: 2024-07-02 DOI:10.1016/j.eng.2024.04.024
Jiarui Xie , Lijun Sun , Yaoyao Fiona Zhao
{"title":"On the Data Quality and Imbalance in Machine Learning-based Design and Manufacturing—A Systematic Review","authors":"Jiarui Xie ,&nbsp;Lijun Sun ,&nbsp;Yaoyao Fiona Zhao","doi":"10.1016/j.eng.2024.04.024","DOIUrl":null,"url":null,"abstract":"<div><div>Machine learning (ML) has recently enabled many modeling tasks in design, manufacturing, and condition monitoring due to its unparalleled learning ability using existing data. Data have become the limiting factor when implementing ML in industry. However, there is no systematic investigation on how data quality can be assessed and improved for ML-based design and manufacturing. The aim of this survey is to uncover the data challenges in this domain and review the techniques used to resolve them. To establish the background for the subsequent analysis, crucial data terminologies in ML-based modeling are reviewed and categorized into data acquisition, management, analysis, and utilization. Thereafter, the concepts and frameworks established to evaluate data quality and imbalance, including data quality assessment, data readiness, information quality, data biases, fairness, and diversity, are further investigated. The root causes and types of data challenges, including human factors, complex systems, complicated relationships, lack of data quality, data heterogeneity, data imbalance, and data scarcity, are identified and summarized. Methods to improve data quality and mitigate data imbalance and their applications in this domain are reviewed. This literature review focuses on two promising methods: data augmentation and active learning. The strengths, limitations, and applicability of the surveyed techniques are illustrated. The trends of data augmentation and active learning are discussed with respect to their applications, data types, and approaches. Based on this discussion, future directions for data quality improvement and data imbalance mitigation in this domain are identified.</div></div>","PeriodicalId":11783,"journal":{"name":"Engineering","volume":"45 ","pages":"Pages 105-131"},"PeriodicalIF":11.6000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2095809924003734","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/7/2 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"ENGINEERING, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

Abstract

Machine learning (ML) has recently enabled many modeling tasks in design, manufacturing, and condition monitoring due to its unparalleled learning ability using existing data. Data have become the limiting factor when implementing ML in industry. However, there is no systematic investigation on how data quality can be assessed and improved for ML-based design and manufacturing. The aim of this survey is to uncover the data challenges in this domain and review the techniques used to resolve them. To establish the background for the subsequent analysis, crucial data terminologies in ML-based modeling are reviewed and categorized into data acquisition, management, analysis, and utilization. Thereafter, the concepts and frameworks established to evaluate data quality and imbalance, including data quality assessment, data readiness, information quality, data biases, fairness, and diversity, are further investigated. The root causes and types of data challenges, including human factors, complex systems, complicated relationships, lack of data quality, data heterogeneity, data imbalance, and data scarcity, are identified and summarized. Methods to improve data quality and mitigate data imbalance and their applications in this domain are reviewed. This literature review focuses on two promising methods: data augmentation and active learning. The strengths, limitations, and applicability of the surveyed techniques are illustrated. The trends of data augmentation and active learning are discussed with respect to their applications, data types, and approaches. Based on this discussion, future directions for data quality improvement and data imbalance mitigation in this domain are identified.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于机器学习的设计与制造中的数据质量与不平衡--系统综述
机器学习(ML)由于其使用现有数据的无与伦比的学习能力,最近在设计、制造和状态监测中实现了许多建模任务。在工业中实现机器学习时,数据已经成为限制因素。然而,对于如何评估和改进基于ml的设计和制造的数据质量,目前还没有系统的研究。本调查的目的是揭示该领域的数据挑战,并回顾用于解决这些挑战的技术。为了建立后续分析的背景,回顾了基于ml的建模中的关键数据术语,并将其分类为数据获取、管理、分析和利用。随后,进一步研究了用于评估数据质量和失衡的概念和框架,包括数据质量评估、数据准备、信息质量、数据偏差、公平性和多样性。从人为因素、复杂系统、复杂关系、数据质量缺失、数据异构、数据不平衡、数据稀缺等方面对数据挑战的根源和类型进行了识别和总结。综述了提高数据质量和缓解数据不平衡的方法及其在该领域的应用。本文综述了两种有前途的方法:数据增强和主动学习。说明了所调查技术的优势、局限性和适用性。讨论了数据增强和主动学习的趋势,以及它们的应用、数据类型和方法。在此基础上,确定了该领域数据质量改进和数据不平衡缓解的未来方向。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Engineering
Engineering Environmental Science-Environmental Engineering
自引率
1.60%
发文量
335
审稿时长
35 days
期刊介绍: Engineering, an international open-access journal initiated by the Chinese Academy of Engineering (CAE) in 2015, serves as a distinguished platform for disseminating cutting-edge advancements in engineering R&D, sharing major research outputs, and highlighting key achievements worldwide. The journal's objectives encompass reporting progress in engineering science, fostering discussions on hot topics, addressing areas of interest, challenges, and prospects in engineering development, while considering human and environmental well-being and ethics in engineering. It aims to inspire breakthroughs and innovations with profound economic and social significance, propelling them to advanced international standards and transforming them into a new productive force. Ultimately, this endeavor seeks to bring about positive changes globally, benefit humanity, and shape a new future.
期刊最新文献
Quasi-Static Hypergraph Neural Networks: A High-Performance Approach for Digital Twin Modeling of Manufacturing Process Systems with Dynamic Performance Evolution The Phospholipid Metabolic Switch in Lung Cancer: Igniting Transformation, Fortifying Survival, and Informing Therapy Baseline Gut Microbiome-Dependent Responses to Probiotic Treatments for IBD in Mice Prospects and Challenges of Multi-Use Offshore Platforms Cloud-Edge-Terminal Collaborative AI Agent for Compound Fault Detection and Diagnosis
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1