Global information regulation network for multimodal sentiment analysis

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Image and Vision Computing Pub Date : 2024-11-01 Epub Date: 2024-10-10 DOI:10.1016/j.imavis.2024.105297

Shufan Xie, Qiaohong Chen, Xian Fang, Qi Sun

{"title":"Global information regulation network for multimodal sentiment analysis","authors":"Shufan Xie, Qiaohong Chen, Xian Fang, Qi Sun","doi":"10.1016/j.imavis.2024.105297","DOIUrl":null,"url":null,"abstract":"<div><div>Human language is considered multimodal, containing natural language, visual elements, and acoustic signals. Multimodal Sentiment Analysis (MSA) concentrates on the integration of various modalities to capture the sentiment polarity or intensity expressed in human language. Nevertheless, the absence of a comprehensive strategy for processing and integrating multimodal representations results in the inclusion of inaccurate or noisy data from diverse modalities in the ultimate decision-making process, potentially leading to the neglect of crucial information within or across modalities. To address this issue, we propose the Global Information Regulation Network (GIRN), a novel framework designed to regulate information flow and decision-making processes across various stages, ranging from unimodal feature extraction to multimodal outcome prediction. Specifically, before modal fusion stage, we maximize the mutual information between modalities and refine the input signals through random feature erasing, yielding a more robust unimodal representation. In the process of modal fusion, we enhance the traditional Transformer encoder through the gate mechanism and stacked attention to dynamically fuse the target and auxiliary modalities. After modal fusion, cross-hierarchical contrastive learning and decision gate are employed to integrate the valuable information represented in different categories and hierarchies. Extensive experiments conducted on the CMU-MOSI and CMU-MOSEI datasets suggest that our methodology outperforms existing approaches across nearly all criteria.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105297"},"PeriodicalIF":4.2000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624004025","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/10/10 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Human language is considered multimodal, containing natural language, visual elements, and acoustic signals. Multimodal Sentiment Analysis (MSA) concentrates on the integration of various modalities to capture the sentiment polarity or intensity expressed in human language. Nevertheless, the absence of a comprehensive strategy for processing and integrating multimodal representations results in the inclusion of inaccurate or noisy data from diverse modalities in the ultimate decision-making process, potentially leading to the neglect of crucial information within or across modalities. To address this issue, we propose the Global Information Regulation Network (GIRN), a novel framework designed to regulate information flow and decision-making processes across various stages, ranging from unimodal feature extraction to multimodal outcome prediction. Specifically, before modal fusion stage, we maximize the mutual information between modalities and refine the input signals through random feature erasing, yielding a more robust unimodal representation. In the process of modal fusion, we enhance the traditional Transformer encoder through the gate mechanism and stacked attention to dynamically fuse the target and auxiliary modalities. After modal fusion, cross-hierarchical contrastive learning and decision gate are employed to integrate the valuable information represented in different categories and hierarchies. Extensive experiments conducted on the CMU-MOSI and CMU-MOSEI datasets suggest that our methodology outperforms existing approaches across nearly all criteria.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于多模态情感分析的全球信息监管网络

人类语言被认为是多模态的，包含自然语言、视觉元素和声音信号。多模态情感分析（MSA）侧重于整合各种模态，以捕捉人类语言中表达的情感极性或强度。然而，由于缺乏处理和整合多模态表征的综合策略，在最终决策过程中会包含来自不同模态的不准确或有噪声的数据，从而可能导致忽略模态内或模态间的关键信息。为了解决这个问题，我们提出了全球信息调节网络（GIRN），这是一个新颖的框架，旨在调节从单模态特征提取到多模态结果预测等各个阶段的信息流和决策过程。具体来说，在模态融合阶段之前，我们会最大化模态之间的互信息，并通过随机特征擦除来完善输入信号，从而获得更稳健的单模态表示。在模态融合过程中，我们通过门机制和叠加注意力来增强传统的 Transformer 编码器，从而实现目标模态和辅助模态的动态融合。模态融合后，我们采用跨层级对比学习和决策门来整合不同类别和层级所代表的有价值信息。在 CMU-MOSI 和 CMU-MOSEI 数据集上进行的大量实验表明，我们的方法在几乎所有标准上都优于现有方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.