TMFN: a text-based multimodal fusion network with multi-scale feature extraction and unsupervised contrastive learning for multimodal sentiment analysis

IF 5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Complex & Intelligent Systems Pub Date : 2025-01-07 DOI:10.1007/s40747-024-01724-5
Junsong Fu, Youjia Fu, Huixia Xue, Zihao Xu
{"title":"TMFN: a text-based multimodal fusion network with multi-scale feature extraction and unsupervised contrastive learning for multimodal sentiment analysis","authors":"Junsong Fu, Youjia Fu, Huixia Xue, Zihao Xu","doi":"10.1007/s40747-024-01724-5","DOIUrl":null,"url":null,"abstract":"<p>Multimodal sentiment analysis (MSA) is crucial in human-computer interaction. Current methods use simple sub-models for feature extraction, neglecting multi-scale features and the complexity of emotions. Text, visual, and audio each have unique characteristics in MSA, with text often providing more emotional cues due to its rich semantics. However, current approaches treat modalities equally, not maximizing text’s advantages. To solve these problems, we propose a novel method named a text-based multimodal fusion network with multi-scale feature extraction and unsupervised contrastive learning (TMFN). Firstly, we propose an innovative pyramid-structured multi-scale feature extraction method, which captures the multi-scale features of modal data through convolution kernels of different sizes and strengthens key features through channel attention mechanism. Second, we design a text-based multimodal feature fusion module, which consists of a text gating unit (TGU) and a text-based channel-wise attention transformer (TCAT). TGU is responsible for guiding and regulating the fusion process of other modal information, while TCAT improves the model’s ability to capture the relationship between features of different modalities and achieves effective feature interaction. Finally, to further optimize the representation of fused features, we introduce unsupervised contrastive learning to deeply explore the intrinsic connection between multi-scale features and fused features. Experimental results show that our proposed model outperforms the state-of-the-art models in MSA on two benchmark datasets.</p>","PeriodicalId":10524,"journal":{"name":"Complex & Intelligent Systems","volume":"37 1","pages":""},"PeriodicalIF":5.0000,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Complex & Intelligent Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s40747-024-01724-5","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Multimodal sentiment analysis (MSA) is crucial in human-computer interaction. Current methods use simple sub-models for feature extraction, neglecting multi-scale features and the complexity of emotions. Text, visual, and audio each have unique characteristics in MSA, with text often providing more emotional cues due to its rich semantics. However, current approaches treat modalities equally, not maximizing text’s advantages. To solve these problems, we propose a novel method named a text-based multimodal fusion network with multi-scale feature extraction and unsupervised contrastive learning (TMFN). Firstly, we propose an innovative pyramid-structured multi-scale feature extraction method, which captures the multi-scale features of modal data through convolution kernels of different sizes and strengthens key features through channel attention mechanism. Second, we design a text-based multimodal feature fusion module, which consists of a text gating unit (TGU) and a text-based channel-wise attention transformer (TCAT). TGU is responsible for guiding and regulating the fusion process of other modal information, while TCAT improves the model’s ability to capture the relationship between features of different modalities and achieves effective feature interaction. Finally, to further optimize the representation of fused features, we introduce unsupervised contrastive learning to deeply explore the intrinsic connection between multi-scale features and fused features. Experimental results show that our proposed model outperforms the state-of-the-art models in MSA on two benchmark datasets.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
TMFN:一种基于文本的多模态融合网络,具有多尺度特征提取和无监督对比学习,用于多模态情感分析
多模态情感分析在人机交互中起着至关重要的作用。目前的方法使用简单的子模型进行特征提取,忽略了多尺度特征和情绪的复杂性。文本、视觉和音频在MSA中各有其独特的特征,文本由于其丰富的语义通常提供更多的情感线索。然而,目前的方法平等对待模式,而不是最大化文本的优势。为了解决这些问题,我们提出了一种基于文本的多模态融合网络的多尺度特征提取和无监督对比学习(TMFN)方法。首先,我们提出了一种创新的金字塔结构多尺度特征提取方法,通过不同大小的卷积核捕获模态数据的多尺度特征,并通过通道关注机制增强关键特征。其次,我们设计了一个基于文本的多模态特征融合模块,该模块由文本门控单元(TGU)和基于文本的通道智能注意力转换器(TCAT)组成。TGU负责引导和调节其他模态信息的融合过程,而TCAT提高了模型捕捉不同模态特征之间关系的能力,实现了有效的特征交互。最后,为了进一步优化融合特征的表示,引入无监督对比学习,深入探索多尺度特征与融合特征之间的内在联系。实验结果表明,在两个基准数据集上,我们提出的模型优于最先进的MSA模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Complex & Intelligent Systems
Complex & Intelligent Systems COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-
CiteScore
9.60
自引率
10.30%
发文量
297
期刊介绍: Complex & Intelligent Systems aims to provide a forum for presenting and discussing novel approaches, tools and techniques meant for attaining a cross-fertilization between the broad fields of complex systems, computational simulation, and intelligent analytics and visualization. The transdisciplinary research that the journal focuses on will expand the boundaries of our understanding by investigating the principles and processes that underlie many of the most profound problems facing society today.
期刊最新文献
A low-carbon scheduling method based on improved ant colony algorithm for underground electric transportation vehicles Vehicle positioning systems in tunnel environments: a review A survey of security threats in federated learning Barriers and enhance strategies for green supply chain management using continuous linear diophantine neural networks XTNSR: Xception-based transformer network for single image super resolution
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1