Leveraging local and global relationships for corrupted label detection

IF 6.2 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Future Generation Computer Systems-The International Journal of Escience Pub Date : 2025-05-01 Epub Date: 2025-01-24 DOI:10.1016/j.future.2025.107729
Phong Lam, Ha-Linh Nguyen, Xuan-Truc Dao Dang, Van-Son Tran, Minh-Duc Le, Thu-Trang Nguyen, Son Nguyen, Hieu Dinh Vo
{"title":"Leveraging local and global relationships for corrupted label detection","authors":"Phong Lam,&nbsp;Ha-Linh Nguyen,&nbsp;Xuan-Truc Dao Dang,&nbsp;Van-Son Tran,&nbsp;Minh-Duc Le,&nbsp;Thu-Trang Nguyen,&nbsp;Son Nguyen,&nbsp;Hieu Dinh Vo","doi":"10.1016/j.future.2025.107729","DOIUrl":null,"url":null,"abstract":"<div><div>The performance of the Machine learning and Deep learning models heavily depends on the quality and quantity of the training data. However, real-world datasets often contain a considerable percentage of noisy labels, ranging from 8.0% to 38.5%. This could significantly reduce model accuracy. To address the problem of corrupted labels, we propose <span>Cola</span>, a novel data-centric approach that leverages both <em>local</em> neighborhood similarities and <em>global</em> relationships across the entire dataset to detect corrupted labels. The main idea of our approach is that similar instances tend to share the same label, and the relationship between clean data can be learned and utilized to distinguish the correct and corrupted labels. Our experiments on four well-established datasets of image and text demonstrate that <span>Cola</span> consistently outperforms state-of-the-art approaches, achieving improvements of 8% to 21% in F1-score for identifying corrupted labels across various noise types and rates. For visual data, <span>Cola</span> achieves improvements of up to 80% in F1-score, while for textual data, the average improvement reaches about 17% with a maximum of 91%. Furthermore, <span>Cola</span> is significantly more effective and efficient in detecting corrupted labels than advanced large language models, such as <em>Llama3</em>, with improvements of up to 112% in Precision and a 300X reduction in execution time.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"166 ","pages":"Article 107729"},"PeriodicalIF":6.2000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X2500024X","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/24 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

The performance of the Machine learning and Deep learning models heavily depends on the quality and quantity of the training data. However, real-world datasets often contain a considerable percentage of noisy labels, ranging from 8.0% to 38.5%. This could significantly reduce model accuracy. To address the problem of corrupted labels, we propose Cola, a novel data-centric approach that leverages both local neighborhood similarities and global relationships across the entire dataset to detect corrupted labels. The main idea of our approach is that similar instances tend to share the same label, and the relationship between clean data can be learned and utilized to distinguish the correct and corrupted labels. Our experiments on four well-established datasets of image and text demonstrate that Cola consistently outperforms state-of-the-art approaches, achieving improvements of 8% to 21% in F1-score for identifying corrupted labels across various noise types and rates. For visual data, Cola achieves improvements of up to 80% in F1-score, while for textual data, the average improvement reaches about 17% with a maximum of 91%. Furthermore, Cola is significantly more effective and efficient in detecting corrupted labels than advanced large language models, such as Llama3, with improvements of up to 112% in Precision and a 300X reduction in execution time.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用本地和全局关系进行损坏标签检测
机器学习和深度学习模型的性能在很大程度上取决于训练数据的质量和数量。然而,现实世界的数据集通常包含相当大比例的噪声标签,范围从8.0%到38.5%。这可能会显著降低模型的准确性。为了解决损坏标签的问题,我们提出了Cola,这是一种新颖的以数据为中心的方法,它利用整个数据集的局部邻域相似性和全局关系来检测损坏的标签。我们的方法的主要思想是,相似的实例倾向于共享相同的标签,干净数据之间的关系可以被学习和利用来区分正确和损坏的标签。我们在四个完善的图像和文本数据集上的实验表明,Cola始终优于最先进的方法,在识别各种噪声类型和率的损坏标签方面,其f1得分提高了8%至21%。对于视觉数据,Cola在F1-score上的改进可达80%,而对于文本数据,平均改进约为17%,最高可达91%。此外,Cola在检测损坏标签方面比先进的大型语言模型(如Llama3)更加有效和高效,精度提高了112%,执行时间减少了300X。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
19.90
自引率
2.70%
发文量
376
审稿时长
10.6 months
期刊介绍: Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications. Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration. Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.
期刊最新文献
Blockchain architectures for enhancing EV infrastructure security: A unified framework for addressing sophisticated cyber-attacks Applying quantum error-correcting codes for fault-tolerant blind quantum cloud computation A swarm intelligence enabled multi-agent reinforcement learning scheme for computational task offloading in internet of things blockchain KnowAIDE: A fAIR-compliant data environment to accelerate AI research Non-intrusive kernel-level dispatching for MQTT shared subscriptions
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1