CE-DCVSI: Multimodal relational extraction based on collaborative enhancement of dual-channel visual semantic information

IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Expert Systems with Applications Pub Date : 2024-11-04 DOI:10.1016/j.eswa.2024.125608
{"title":"CE-DCVSI: Multimodal relational extraction based on collaborative enhancement of dual-channel visual semantic information","authors":"","doi":"10.1016/j.eswa.2024.125608","DOIUrl":null,"url":null,"abstract":"<div><div>Visual information implied by the images in multimodal relation extraction (MRE) usually contains details that are difficult to describe in text sentences. Integrating textual and visual information is the mainstream method to enhance the understanding and extraction of relations between entities. However, existing MRE methods neglect the semantic gap caused by data heterogeneity. Besides, some approaches map the relations between target objects in image scene graphs to text, but massive invalid visual relations introduce noise. To alleviate the above problems, we propose a novel multimodal relation extraction method based on cooperative enhancement of dual-channel visual semantic information (CE-DCVSI). Specifically, to mitigate the semantic gap between modalities, we realize fine-grained semantic alignment between entities and target objects through multimodal heterogeneous graphs, aligning feature representations of different modalities into the same semantic space using the heterogeneous graph Transformer, thus promoting the consistency and accuracy of feature representations. To eliminate the effect of useless visual relations, we perform multi-scale feature fusion between different levels of visual information and textual representations to increase the complementarity between features, improving the comprehensiveness and robustness of the multimodal representation. Finally, we utilize the information bottleneck principle to filter out invalid information from the multimodal representation to mitigate the negative impact of irrelevant noise. The experiments demonstrate that the method achieves 86.08% of the F1 score on the publicly available MRE dataset, which outperforms other baseline methods.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":null,"pages":null},"PeriodicalIF":7.5000,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417424024758","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Visual information implied by the images in multimodal relation extraction (MRE) usually contains details that are difficult to describe in text sentences. Integrating textual and visual information is the mainstream method to enhance the understanding and extraction of relations between entities. However, existing MRE methods neglect the semantic gap caused by data heterogeneity. Besides, some approaches map the relations between target objects in image scene graphs to text, but massive invalid visual relations introduce noise. To alleviate the above problems, we propose a novel multimodal relation extraction method based on cooperative enhancement of dual-channel visual semantic information (CE-DCVSI). Specifically, to mitigate the semantic gap between modalities, we realize fine-grained semantic alignment between entities and target objects through multimodal heterogeneous graphs, aligning feature representations of different modalities into the same semantic space using the heterogeneous graph Transformer, thus promoting the consistency and accuracy of feature representations. To eliminate the effect of useless visual relations, we perform multi-scale feature fusion between different levels of visual information and textual representations to increase the complementarity between features, improving the comprehensiveness and robustness of the multimodal representation. Finally, we utilize the information bottleneck principle to filter out invalid information from the multimodal representation to mitigate the negative impact of irrelevant noise. The experiments demonstrate that the method achieves 86.08% of the F1 score on the publicly available MRE dataset, which outperforms other baseline methods.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
CE-DCVSI:基于双通道视觉语义信息协同增强的多模态关系提取
在多模态关系提取(MRE)中,图像所隐含的视觉信息通常包含难以用文本句子描述的细节。整合文本和视觉信息是增强实体间关系理解和提取的主流方法。然而,现有的 MRE 方法忽视了数据异质性造成的语义差距。此外,有些方法将图像场景图中目标对象之间的关系映射到文本中,但大量无效的视觉关系会带来噪声。为了解决上述问题,我们提出了一种基于双通道视觉语义信息协同增强(CE-DCVSI)的新型多模态关系提取方法。具体来说,为了缓解模态之间的语义差距,我们通过多模态异构图实现了实体与目标对象之间的细粒度语义对齐,利用异构图变换器将不同模态的特征表征对齐到同一语义空间,从而提高了特征表征的一致性和准确性。为了消除无用视觉关系的影响,我们在不同层次的视觉信息和文本表征之间进行多尺度特征融合,以增加特征之间的互补性,提高多模态表征的全面性和鲁棒性。最后,我们利用信息瓶颈原理过滤掉多模态表征中的无效信息,以减轻无关噪声的负面影响。实验证明,该方法在公开的 MRE 数据集上获得了 86.08% 的 F1 分数,优于其他基线方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Expert Systems with Applications
Expert Systems with Applications 工程技术-工程:电子与电气
CiteScore
13.80
自引率
10.60%
发文量
2045
审稿时长
8.7 months
期刊介绍: Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.
期刊最新文献
CE-DCVSI: Multimodal relational extraction based on collaborative enhancement of dual-channel visual semantic information Learning face super-resolution through identity features and distilling facial prior knowledge Identification of gene regulatory networks associated with breast cancer patient survival using an interpretable deep neural network model A high-effective swarm intelligence-based multi-robot cooperation method for target searching in unknown hazardous environments A new look of dispatching for multi-objective interbay AMHS in semiconductor wafer manufacturing: A T–S fuzzy-based learning approach
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1