CRMSP: A semi-supervised approach for key information extraction with Class-Rebalancing and Merged Semantic Pseudo-Labeling

IF 5.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Neurocomputing Pub Date : 2024-11-16 DOI:10.1016/j.neucom.2024.128907

Qi Zhang, Yonghong Song, Pengcheng Guo, Yangyang Hui

{"title":"CRMSP: A semi-supervised approach for key information extraction with Class-Rebalancing and Merged Semantic Pseudo-Labeling","authors":"Qi Zhang, Yonghong Song, Pengcheng Guo, Yangyang Hui","doi":"10.1016/j.neucom.2024.128907","DOIUrl":null,"url":null,"abstract":"<div><div>There is a growing demand in the field of Key Information Extraction (KIE) to apply semi-supervised learning (SSL) to save manpower and costs, as training document data using fully-supervised methods requires labor-intensive manual annotation. The main challenges of applying SSL in the KIE are (1) underestimation of the confidence of tail classes in the long-tailed distribution and (2) difficulty in achieving intra-class compactness and inter-class separability of tail features. To address these challenges, we propose a novel semi-supervised approach for KIE with Class-Rebalancing and Merged Semantic Pseudo-Labeling (CRMSP). Firstly, the Class-Rebalancing Pseudo-Labeling (CRP) module introduces a reweighting factor to rebalance pseudo-labels, increasing attention to tail classes. Secondly, we propose the Merged Semantic Pseudo-Labeling (MSP) module to cluster tail features of unlabeled data by assigning samples to Merged Prototypes (MP). Additionally, we designed a new contrastive loss specifically for MSP. Extensive experimental results on three well-known benchmarks demonstrate that CRMSP achieves state-of-the-art performance. Remarkably, CRMSP achieves 3.24% f1-score improvement over state-of-the-art on the CORD.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"616 ","pages":"Article 128907"},"PeriodicalIF":5.5000,"publicationDate":"2024-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224016783","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

There is a growing demand in the field of Key Information Extraction (KIE) to apply semi-supervised learning (SSL) to save manpower and costs, as training document data using fully-supervised methods requires labor-intensive manual annotation. The main challenges of applying SSL in the KIE are (1) underestimation of the confidence of tail classes in the long-tailed distribution and (2) difficulty in achieving intra-class compactness and inter-class separability of tail features. To address these challenges, we propose a novel semi-supervised approach for KIE with Class-Rebalancing and Merged Semantic Pseudo-Labeling (CRMSP). Firstly, the Class-Rebalancing Pseudo-Labeling (CRP) module introduces a reweighting factor to rebalance pseudo-labels, increasing attention to tail classes. Secondly, we propose the Merged Semantic Pseudo-Labeling (MSP) module to cluster tail features of unlabeled data by assigning samples to Merged Prototypes (MP). Additionally, we designed a new contrastive loss specifically for MSP. Extensive experimental results on three well-known benchmarks demonstrate that CRMSP achieves state-of-the-art performance. Remarkably, CRMSP achieves 3.24% f1-score improvement over state-of-the-art on the CORD.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于类再平衡和合并语义伪标记的半监督关键信息提取方法

关键信息提取（Key Information Extraction， KIE）领域对半监督学习（semi-supervised learning， SSL）的应用需求日益增长，以节省人力和成本，因为使用全监督方法训练文档数据需要耗费大量劳动的人工标注。在KIE中应用SSL的主要挑战是：(1)低估尾类在长尾分布中的置信度；(2)难以实现尾特征的类内紧密性和类间可分性。为了解决这些挑战，我们提出了一种基于类再平衡和合并语义伪标记（CRMSP）的半监督KIE方法。首先，类再平衡伪标签（CRP）模块引入了一个重新加权因子来重新平衡伪标签，增加了对尾部类的关注。其次，我们提出了合并语义伪标记（MSP）模块，通过将样本分配给合并原型（MP）来对未标记数据的尾部特征进行聚类。此外，我们还专门为MSP设计了一种新的对比损耗。在三个知名基准测试上的大量实验结果表明，CRMSP达到了最先进的性能。值得注意的是，CRMSP在CORD上的得分比最先进的水平提高了3.24%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Neurocomputing 工程技术-计算机：人工智能

CiteScore

13.10

自引率

10.00%

发文量

1382

审稿时长

70 days

期刊介绍： Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.