Boosting semi-supervised learning under imbalanced regression via pseudo-labeling

IF 1.5 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING Concurrency and Computation-Practice & Experience Pub Date : 2024-06-30 DOI:10.1002/cpe.8103

Nannan Zong, Songzhi Su, Changle Zhou

{"title":"Boosting semi-supervised learning under imbalanced regression via pseudo-labeling","authors":"Nannan Zong, Songzhi Su, Changle Zhou","doi":"10.1002/cpe.8103","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Imbalanced samples are widespread, which impairs the generalization and fairness of models. Semi-supervised learning can overcome the deficiency of rare labeled samples, but it is challenging to select high-quality pseudo-label data. Unlike discrete labels that can be matched one-to-one with points on a numerical axis, labels in regression tasks are consecutive and cannot be directly chosen. Besides, the distribution of unlabeled data is imbalanced, which easily leads to an imbalanced distribution of pseudo-label data, exacerbating the imbalance in the semi-supervised dataset. To solve this problem, this article proposes a semi-supervised imbalanced regression network (SIRN), which consists of two components: A, designed to learn the relationship between features and labels (targets), and B, dedicated to learning the relationship between features and target deviations. To measure target deviations under imbalanced distribution, the target deviation function is introduced. To select continuous pseudo-labels, the deviation matching strategy is designed. Furthermore, an adaptive selection function is developed to mitigate the risk of skewed distributions due to imbalanced pseudo-label data. Finally, the effectiveness of the proposed method is validated through evaluations of two regression tasks. The results show a great reduction in predicted value error, particularly in few-shot regions. This empirical evidence confirms the efficacy of our method in addressing the issue of imbalanced samples in regression tasks.</p>\n </div>","PeriodicalId":55214,"journal":{"name":"Concurrency and Computation-Practice & Experience","volume":"36 19","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2024-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Concurrency and Computation-Practice & Experience","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cpe.8103","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Imbalanced samples are widespread, which impairs the generalization and fairness of models. Semi-supervised learning can overcome the deficiency of rare labeled samples, but it is challenging to select high-quality pseudo-label data. Unlike discrete labels that can be matched one-to-one with points on a numerical axis, labels in regression tasks are consecutive and cannot be directly chosen. Besides, the distribution of unlabeled data is imbalanced, which easily leads to an imbalanced distribution of pseudo-label data, exacerbating the imbalance in the semi-supervised dataset. To solve this problem, this article proposes a semi-supervised imbalanced regression network (SIRN), which consists of two components: A, designed to learn the relationship between features and labels (targets), and B, dedicated to learning the relationship between features and target deviations. To measure target deviations under imbalanced distribution, the target deviation function is introduced. To select continuous pseudo-labels, the deviation matching strategy is designed. Furthermore, an adaptive selection function is developed to mitigate the risk of skewed distributions due to imbalanced pseudo-label data. Finally, the effectiveness of the proposed method is validated through evaluations of two regression tasks. The results show a great reduction in predicted value error, particularly in few-shot regions. This empirical evidence confirms the efficacy of our method in addressing the issue of imbalanced samples in regression tasks.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过伪标记在不平衡回归条件下促进半监督学习

摘要不平衡样本很普遍，这会损害模型的泛化和公平性。半监督学习可以克服稀有标签样本的不足，但要选择高质量的伪标签数据却很有难度。离散标签可以与数字轴上的点一一对应，而回归任务中的标签是连续的，无法直接选择。此外，无标签数据的分布是不平衡的，这容易导致伪标签数据的分布不平衡，加剧半监督数据集的不平衡。为了解决这个问题，本文提出了一种半监督不平衡回归网络（SIRN），它由两个部分组成：A 部分旨在学习特征与标签（目标）之间的关系，B 部分专门用于学习特征与目标偏差之间的关系。为了测量不平衡分布下的目标偏差，引入了目标偏差函数。为了选择连续的伪标签，设计了偏差匹配策略。此外，还开发了一种自适应选择函数，以减轻不平衡伪标签数据导致的偏斜分布风险。最后，通过对两项回归任务的评估，验证了所提方法的有效性。结果表明，预测值误差大大降低，尤其是在少拍区域。这一经验证据证实了我们的方法在解决回归任务中不平衡样本问题方面的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Concurrency and Computation-Practice & Experience 工程技术-计算机：理论方法

CiteScore

5.00

自引率

10.00%

发文量

664

审稿时长

9.6 months

期刊介绍： Concurrency and Computation: Practice and Experience (CCPE) publishes high-quality, original research papers, and authoritative research review papers, in the overlapping fields of: Parallel and distributed computing; High-performance computing; Computational and data science; Artificial intelligence and machine learning; Big data applications, algorithms, and systems; Network science; Ontologies and semantics; Security and privacy; Cloud/edge/fog computing; Green computing; and Quantum computing.

期刊最新文献

Issue Information Improving QoS in cloud resources scheduling using dynamic clustering algorithm and SM-CDC scheduling model Issue Information Issue Information Camellia oleifera trunks detection and identification based on improved YOLOv7