Connecting Cross-Modal Representations for Compact and Robust Multimodal Sentiment Analysis With Sentiment Word Substitution Error

IF 9.8 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE IEEE Transactions on Affective Computing Pub Date : 2024-11-04 DOI:10.1109/TAFFC.2024.3490694
Qiyuan Sun;Haolin Zuo;Rui Liu;Haizhou Li
{"title":"Connecting Cross-Modal Representations for Compact and Robust Multimodal Sentiment Analysis With Sentiment Word Substitution Error","authors":"Qiyuan Sun;Haolin Zuo;Rui Liu;Haizhou Li","doi":"10.1109/TAFFC.2024.3490694","DOIUrl":null,"url":null,"abstract":"Multimodal Sentiment Analysis (MSA) seeks to fuse textual, acoustic, and visual information to predict a speaker’s sentiment states effectively. However, in real-world scenarios, the text modality received by MSA systems is often obtained through automatic speech recognition (ASR) models. Unfortunately, ASR may erroneously recognize sentiment words as phonetically similar neutral alternatives, leading to sentiment degradation in text and impacting MSA accuracy. Recent attempts aim to first identify the sentiment word substitution (SWS) error in ASR results and then refine the corrupted word embeddings using multimodal information for final multimodal fusion. However, such a method includes a burdensome and ambiguous detection operation and ignores the inherent correlations and heterogeneity among different modalities. To address these issues, we propose a more compact system, termed <bold>ARF-MSA</b> consisting of three key components to achieving robust MSA with SWS errors: 1) <bold>Alignment</b>: we establish connections between the “text-acoustic’ and “text-visual” representations to effectively map the “text-acoustic-visual” data into a unified sentiment space by leveraging their multimodal correlation knowledge; 2) <bold>Refinement</b>: we perform fine-grained comparisons between the text modality and the other two modalities in the unified sentiment space, enabling refinement of the sentiment expression within the text modality more concisely; 3) <bold>Fusion</b>: Finally, we hierarchically fuse the dominant and non-dominant representation from three heterogeneity modalities to obtain the multimodal feature for MSA. We conduct extensive experiments on the real-world datasets and the results demonstrate the effectiveness of our model.","PeriodicalId":13131,"journal":{"name":"IEEE Transactions on Affective Computing","volume":"16 3","pages":"1265-1276"},"PeriodicalIF":9.8000,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Affective Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10741889/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Multimodal Sentiment Analysis (MSA) seeks to fuse textual, acoustic, and visual information to predict a speaker’s sentiment states effectively. However, in real-world scenarios, the text modality received by MSA systems is often obtained through automatic speech recognition (ASR) models. Unfortunately, ASR may erroneously recognize sentiment words as phonetically similar neutral alternatives, leading to sentiment degradation in text and impacting MSA accuracy. Recent attempts aim to first identify the sentiment word substitution (SWS) error in ASR results and then refine the corrupted word embeddings using multimodal information for final multimodal fusion. However, such a method includes a burdensome and ambiguous detection operation and ignores the inherent correlations and heterogeneity among different modalities. To address these issues, we propose a more compact system, termed ARF-MSA consisting of three key components to achieving robust MSA with SWS errors: 1) Alignment: we establish connections between the “text-acoustic’ and “text-visual” representations to effectively map the “text-acoustic-visual” data into a unified sentiment space by leveraging their multimodal correlation knowledge; 2) Refinement: we perform fine-grained comparisons between the text modality and the other two modalities in the unified sentiment space, enabling refinement of the sentiment expression within the text modality more concisely; 3) Fusion: Finally, we hierarchically fuse the dominant and non-dominant representation from three heterogeneity modalities to obtain the multimodal feature for MSA. We conduct extensive experiments on the real-world datasets and the results demonstrate the effectiveness of our model.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用情感词语替换误差连接跨模态表征,实现紧凑、稳健的多模态情感分析
多模态情感分析(MSA)旨在融合文本、声学和视觉信息来有效地预测说话人的情感状态。然而,在现实场景中,MSA系统接收到的文本模态通常是通过自动语音识别(ASR)模型获得的。不幸的是,ASR可能会错误地将情感词识别为语音相似的中性替代词,从而导致文本中的情感退化并影响MSA的准确性。最近的尝试旨在首先识别ASR结果中的情感词替换(SWS)错误,然后使用多模态信息改进损坏的词嵌入,最终进行多模态融合。然而,这种方法包含了繁琐和模糊的检测操作,并且忽略了不同模态之间的内在相关性和异质性。为了解决这些问题,我们提出了一个更紧凑的系统,称为ARF-MSA,由三个关键组件组成,以实现具有SWS错误的鲁棒MSA: 1)对齐:我们建立“文本-声学”和“文本-视觉”表示之间的联系,通过利用它们的多模态相关知识,有效地将“文本-声学-视觉”数据映射到统一的情感空间;2)精细化:在统一的情感空间中对文本情态与其他两种情态进行细粒度比较,使文本情态内的情感表达精细化;3)融合:最后,对三种异质性模式下的显性和非显性表征进行分层融合,得到多模态特征。我们在真实世界的数据集上进行了大量的实验,结果证明了我们模型的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE Transactions on Affective Computing
IEEE Transactions on Affective Computing COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, CYBERNETICS
CiteScore
15.00
自引率
6.20%
发文量
174
期刊介绍: The IEEE Transactions on Affective Computing is an international and interdisciplinary journal. Its primary goal is to share research findings on the development of systems capable of recognizing, interpreting, and simulating human emotions and related affective phenomena. The journal publishes original research on the underlying principles and theories that explain how and why affective factors shape human-technology interactions. It also focuses on how techniques for sensing and simulating affect can enhance our understanding of human emotions and processes. Additionally, the journal explores the design, implementation, and evaluation of systems that prioritize the consideration of affect in their usability. We also welcome surveys of existing work that provide new perspectives on the historical and future directions of this field.
期刊最新文献
UCSM-TG: Utterance, Conversation and Speaker-level Speech Emotion Tracking Model in Conversations Using Transformer-GRU Strength in Numbers, Power in Subjectivity: Scalable Modeling of Individual Annotators for Emotion Recognition Within and Across Corpora LPM-Aug: Latent Pathology-Informed Multimodal Augmentation for Generalized Cognitive Decline Detection Via Speech MA-DLE: Speech-based Automatic Depression Level Estimation via Memory Augmentation Reversible Graph Neural Network-based Reaction Distribution Learning for Multiple Appropriate Facial Reactions Generation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1