TMFN: a text-based multimodal fusion network with multi-scale feature extraction and unsupervised contrastive learning for multimodal sentiment analysis

IF 5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Complex & Intelligent Systems Pub Date : 2025-01-07 DOI:10.1007/s40747-024-01724-5

Junsong Fu, Youjia Fu, Huixia Xue, Zihao Xu

{"title":"TMFN: a text-based multimodal fusion network with multi-scale feature extraction and unsupervised contrastive learning for multimodal sentiment analysis","authors":"Junsong Fu, Youjia Fu, Huixia Xue, Zihao Xu","doi":"10.1007/s40747-024-01724-5","DOIUrl":null,"url":null,"abstract":"<p>Multimodal sentiment analysis (MSA) is crucial in human-computer interaction. Current methods use simple sub-models for feature extraction, neglecting multi-scale features and the complexity of emotions. Text, visual, and audio each have unique characteristics in MSA, with text often providing more emotional cues due to its rich semantics. However, current approaches treat modalities equally, not maximizing text’s advantages. To solve these problems, we propose a novel method named a text-based multimodal fusion network with multi-scale feature extraction and unsupervised contrastive learning (TMFN). Firstly, we propose an innovative pyramid-structured multi-scale feature extraction method, which captures the multi-scale features of modal data through convolution kernels of different sizes and strengthens key features through channel attention mechanism. Second, we design a text-based multimodal feature fusion module, which consists of a text gating unit (TGU) and a text-based channel-wise attention transformer (TCAT). TGU is responsible for guiding and regulating the fusion process of other modal information, while TCAT improves the model’s ability to capture the relationship between features of different modalities and achieves effective feature interaction. Finally, to further optimize the representation of fused features, we introduce unsupervised contrastive learning to deeply explore the intrinsic connection between multi-scale features and fused features. Experimental results show that our proposed model outperforms the state-of-the-art models in MSA on two benchmark datasets.</p>","PeriodicalId":10524,"journal":{"name":"Complex & Intelligent Systems","volume":"37 1","pages":""},"PeriodicalIF":5.0000,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Complex & Intelligent Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s40747-024-01724-5","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Multimodal sentiment analysis (MSA) is crucial in human-computer interaction. Current methods use simple sub-models for feature extraction, neglecting multi-scale features and the complexity of emotions. Text, visual, and audio each have unique characteristics in MSA, with text often providing more emotional cues due to its rich semantics. However, current approaches treat modalities equally, not maximizing text’s advantages. To solve these problems, we propose a novel method named a text-based multimodal fusion network with multi-scale feature extraction and unsupervised contrastive learning (TMFN). Firstly, we propose an innovative pyramid-structured multi-scale feature extraction method, which captures the multi-scale features of modal data through convolution kernels of different sizes and strengthens key features through channel attention mechanism. Second, we design a text-based multimodal feature fusion module, which consists of a text gating unit (TGU) and a text-based channel-wise attention transformer (TCAT). TGU is responsible for guiding and regulating the fusion process of other modal information, while TCAT improves the model’s ability to capture the relationship between features of different modalities and achieves effective feature interaction. Finally, to further optimize the representation of fused features, we introduce unsupervised contrastive learning to deeply explore the intrinsic connection between multi-scale features and fused features. Experimental results show that our proposed model outperforms the state-of-the-art models in MSA on two benchmark datasets.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

TMFN：一种基于文本的多模态融合网络，具有多尺度特征提取和无监督对比学习，用于多模态情感分析

多模态情感分析在人机交互中起着至关重要的作用。目前的方法使用简单的子模型进行特征提取，忽略了多尺度特征和情绪的复杂性。文本、视觉和音频在MSA中各有其独特的特征，文本由于其丰富的语义通常提供更多的情感线索。然而，目前的方法平等对待模式，而不是最大化文本的优势。为了解决这些问题，我们提出了一种基于文本的多模态融合网络的多尺度特征提取和无监督对比学习（TMFN）方法。首先，我们提出了一种创新的金字塔结构多尺度特征提取方法，通过不同大小的卷积核捕获模态数据的多尺度特征，并通过通道关注机制增强关键特征。其次，我们设计了一个基于文本的多模态特征融合模块，该模块由文本门控单元（TGU）和基于文本的通道智能注意力转换器（TCAT）组成。TGU负责引导和调节其他模态信息的融合过程，而TCAT提高了模型捕捉不同模态特征之间关系的能力，实现了有效的特征交互。最后，为了进一步优化融合特征的表示，引入无监督对比学习，深入探索多尺度特征与融合特征之间的内在联系。实验结果表明，在两个基准数据集上，我们提出的模型优于最先进的MSA模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Complex & Intelligent Systems COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

9.60

自引率

10.30%

发文量

297

期刊介绍： Complex & Intelligent Systems aims to provide a forum for presenting and discussing novel approaches, tools and techniques meant for attaining a cross-fertilization between the broad fields of complex systems, computational simulation, and intelligent analytics and visualization. The transdisciplinary research that the journal focuses on will expand the boundaries of our understanding by investigating the principles and processes that underlie many of the most profound problems facing society today.