Attention-optimized vision-enhanced prompt learning for few-shot multi-modal sentiment analysis

Zikai Zhou, Baiyou Qiao, Haisong Feng, Donghong Han, Gang Wu
{"title":"Attention-optimized vision-enhanced prompt learning for few-shot multi-modal sentiment analysis","authors":"Zikai Zhou, Baiyou Qiao, Haisong Feng, Donghong Han, Gang Wu","doi":"10.1007/s00521-024-10297-w","DOIUrl":null,"url":null,"abstract":"<p>To fulfill the explosion of multi-modal data, multi-modal sentiment analysis (MSA) emerged and attracted widespread attention. Unfortunately, conventional multi-modal research relies on large-scale datasets. On the one hand, collecting and annotating large-scale datasets is challenging and resource-intensive. On the other hand, the training on large-scale datasets also increases the research cost. However, the few-shot MSA (FMSA), which is proposed recently, requires only few samples for training. Therefore, in comparison, it is more practical and realistic. There have been approaches to investigating the prompt-based method in the field of FMSA, but they have not sufficiently considered or leveraged the information specificity of visual modality. Thus, we propose a vision-enhanced prompt-based model based on graph structure to better utilize vision information for fusion and collaboration in encoding and optimizing prompt representations. Specifically, we first design an aggregation-based multi-modal attention module. Then, based on this module and the biaffine attention, we construct a syntax–semantic dual-channel graph convolutional network to optimize the encoding of learnable prompts by understanding the vision-enhanced information in semantic and syntactic knowledge. Finally, we propose a collaboration-based optimization module based on the collaborative attention mechanism, which employs visual information to collaboratively optimize prompt representations. Extensive experiments conducted on both coarse-grained and fine-grained MSA datasets have demonstrated that our model significantly outperforms the baseline models.</p>","PeriodicalId":18925,"journal":{"name":"Neural Computing and Applications","volume":"37 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Computing and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00521-024-10297-w","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

To fulfill the explosion of multi-modal data, multi-modal sentiment analysis (MSA) emerged and attracted widespread attention. Unfortunately, conventional multi-modal research relies on large-scale datasets. On the one hand, collecting and annotating large-scale datasets is challenging and resource-intensive. On the other hand, the training on large-scale datasets also increases the research cost. However, the few-shot MSA (FMSA), which is proposed recently, requires only few samples for training. Therefore, in comparison, it is more practical and realistic. There have been approaches to investigating the prompt-based method in the field of FMSA, but they have not sufficiently considered or leveraged the information specificity of visual modality. Thus, we propose a vision-enhanced prompt-based model based on graph structure to better utilize vision information for fusion and collaboration in encoding and optimizing prompt representations. Specifically, we first design an aggregation-based multi-modal attention module. Then, based on this module and the biaffine attention, we construct a syntax–semantic dual-channel graph convolutional network to optimize the encoding of learnable prompts by understanding the vision-enhanced information in semantic and syntactic knowledge. Finally, we propose a collaboration-based optimization module based on the collaborative attention mechanism, which employs visual information to collaboratively optimize prompt representations. Extensive experiments conducted on both coarse-grained and fine-grained MSA datasets have demonstrated that our model significantly outperforms the baseline models.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
针对少镜头多模态情感分析的注意力优化视觉增强提示学习
为了应对多模态数据的爆炸式增长,多模态情感分析(MSA)应运而生,并引起了广泛关注。遗憾的是,传统的多模态研究依赖于大规模数据集。一方面,收集和注释大规模数据集是一项具有挑战性的资源密集型工作。另一方面,在大规模数据集上进行训练也增加了研究成本。然而,最近提出的少量样本 MSA(FMSA)只需要少量样本进行训练。因此,相比之下,它更实用、更现实。在 FMSA 领域,已经有研究基于提示的方法的方法,但这些方法没有充分考虑或利用视觉模式的信息特异性。因此,我们提出了一种基于图结构的视觉增强型提示模型,以便在编码和优化提示表征时更好地利用视觉信息进行融合与协作。具体来说,我们首先设计了一个基于聚合的多模态注意力模块。然后,基于该模块和双模注意力,我们构建了一个语法-语义双通道图卷积网络,通过理解视觉增强的语义和句法知识信息来优化可学习提示的编码。最后,我们提出了基于协作注意机制的协作优化模块,该模块利用视觉信息协作优化提示表征。在粗粒度和细粒度 MSA 数据集上进行的大量实验表明,我们的模型明显优于基线模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Neuroevolution gives rise to more focused information transfer compared to backpropagation in recurrent neural networks. Potential analysis of radiographic images to determine infestation of rice seeds Recommendation systems with user and item profiles based on symbolic modal data End-to-end entity extraction from OCRed texts using summarization models Firearm detection using DETR with multiple self-coordinated neural networks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1