CE-DCVSI: Multimodal relational extraction based on collaborative enhancement of dual-channel visual semantic information

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Expert Systems with Applications Pub Date : 2024-11-04 DOI:10.1016/j.eswa.2024.125608

{"title":"CE-DCVSI: Multimodal relational extraction based on collaborative enhancement of dual-channel visual semantic information","authors":"","doi":"10.1016/j.eswa.2024.125608","DOIUrl":null,"url":null,"abstract":"<div><div>Visual information implied by the images in multimodal relation extraction (MRE) usually contains details that are difficult to describe in text sentences. Integrating textual and visual information is the mainstream method to enhance the understanding and extraction of relations between entities. However, existing MRE methods neglect the semantic gap caused by data heterogeneity. Besides, some approaches map the relations between target objects in image scene graphs to text, but massive invalid visual relations introduce noise. To alleviate the above problems, we propose a novel multimodal relation extraction method based on cooperative enhancement of dual-channel visual semantic information (CE-DCVSI). Specifically, to mitigate the semantic gap between modalities, we realize fine-grained semantic alignment between entities and target objects through multimodal heterogeneous graphs, aligning feature representations of different modalities into the same semantic space using the heterogeneous graph Transformer, thus promoting the consistency and accuracy of feature representations. To eliminate the effect of useless visual relations, we perform multi-scale feature fusion between different levels of visual information and textual representations to increase the complementarity between features, improving the comprehensiveness and robustness of the multimodal representation. Finally, we utilize the information bottleneck principle to filter out invalid information from the multimodal representation to mitigate the negative impact of irrelevant noise. The experiments demonstrate that the method achieves 86.08% of the F1 score on the publicly available MRE dataset, which outperforms other baseline methods.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":null,"pages":null},"PeriodicalIF":7.5000,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417424024758","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Visual information implied by the images in multimodal relation extraction (MRE) usually contains details that are difficult to describe in text sentences. Integrating textual and visual information is the mainstream method to enhance the understanding and extraction of relations between entities. However, existing MRE methods neglect the semantic gap caused by data heterogeneity. Besides, some approaches map the relations between target objects in image scene graphs to text, but massive invalid visual relations introduce noise. To alleviate the above problems, we propose a novel multimodal relation extraction method based on cooperative enhancement of dual-channel visual semantic information (CE-DCVSI). Specifically, to mitigate the semantic gap between modalities, we realize fine-grained semantic alignment between entities and target objects through multimodal heterogeneous graphs, aligning feature representations of different modalities into the same semantic space using the heterogeneous graph Transformer, thus promoting the consistency and accuracy of feature representations. To eliminate the effect of useless visual relations, we perform multi-scale feature fusion between different levels of visual information and textual representations to increase the complementarity between features, improving the comprehensiveness and robustness of the multimodal representation. Finally, we utilize the information bottleneck principle to filter out invalid information from the multimodal representation to mitigate the negative impact of irrelevant noise. The experiments demonstrate that the method achieves 86.08% of the F1 score on the publicly available MRE dataset, which outperforms other baseline methods.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

CE-DCVSI：基于双通道视觉语义信息协同增强的多模态关系提取

在多模态关系提取（MRE）中，图像所隐含的视觉信息通常包含难以用文本句子描述的细节。整合文本和视觉信息是增强实体间关系理解和提取的主流方法。然而，现有的 MRE 方法忽视了数据异质性造成的语义差距。此外，有些方法将图像场景图中目标对象之间的关系映射到文本中，但大量无效的视觉关系会带来噪声。为了解决上述问题，我们提出了一种基于双通道视觉语义信息协同增强（CE-DCVSI）的新型多模态关系提取方法。具体来说，为了缓解模态之间的语义差距，我们通过多模态异构图实现了实体与目标对象之间的细粒度语义对齐，利用异构图变换器将不同模态的特征表征对齐到同一语义空间，从而提高了特征表征的一致性和准确性。为了消除无用视觉关系的影响，我们在不同层次的视觉信息和文本表征之间进行多尺度特征融合，以增加特征之间的互补性，提高多模态表征的全面性和鲁棒性。最后，我们利用信息瓶颈原理过滤掉多模态表征中的无效信息，以减轻无关噪声的负面影响。实验证明，该方法在公开的 MRE 数据集上获得了 86.08% 的 F1 分数，优于其他基线方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Expert Systems with Applications 工程技术-工程：电子与电气

CiteScore

13.80

自引率

10.60%

发文量

2045

审稿时长

8.7 months

期刊介绍： Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.