{"title":"Connecting Cross-Modal Representations for Compact and Robust Multimodal Sentiment Analysis With Sentiment Word Substitution Error","authors":"Qiyuan Sun;Haolin Zuo;Rui Liu;Haizhou Li","doi":"10.1109/TAFFC.2024.3490694","DOIUrl":null,"url":null,"abstract":"Multimodal Sentiment Analysis (MSA) seeks to fuse textual, acoustic, and visual information to predict a speaker’s sentiment states effectively. However, in real-world scenarios, the text modality received by MSA systems is often obtained through automatic speech recognition (ASR) models. Unfortunately, ASR may erroneously recognize sentiment words as phonetically similar neutral alternatives, leading to sentiment degradation in text and impacting MSA accuracy. Recent attempts aim to first identify the sentiment word substitution (SWS) error in ASR results and then refine the corrupted word embeddings using multimodal information for final multimodal fusion. However, such a method includes a burdensome and ambiguous detection operation and ignores the inherent correlations and heterogeneity among different modalities. To address these issues, we propose a more compact system, termed <bold>ARF-MSA</b> consisting of three key components to achieving robust MSA with SWS errors: 1) <bold>Alignment</b>: we establish connections between the “text-acoustic’ and “text-visual” representations to effectively map the “text-acoustic-visual” data into a unified sentiment space by leveraging their multimodal correlation knowledge; 2) <bold>Refinement</b>: we perform fine-grained comparisons between the text modality and the other two modalities in the unified sentiment space, enabling refinement of the sentiment expression within the text modality more concisely; 3) <bold>Fusion</b>: Finally, we hierarchically fuse the dominant and non-dominant representation from three heterogeneity modalities to obtain the multimodal feature for MSA. We conduct extensive experiments on the real-world datasets and the results demonstrate the effectiveness of our model.","PeriodicalId":13131,"journal":{"name":"IEEE Transactions on Affective Computing","volume":"16 3","pages":"1265-1276"},"PeriodicalIF":9.8000,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Affective Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10741889/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Multimodal Sentiment Analysis (MSA) seeks to fuse textual, acoustic, and visual information to predict a speaker’s sentiment states effectively. However, in real-world scenarios, the text modality received by MSA systems is often obtained through automatic speech recognition (ASR) models. Unfortunately, ASR may erroneously recognize sentiment words as phonetically similar neutral alternatives, leading to sentiment degradation in text and impacting MSA accuracy. Recent attempts aim to first identify the sentiment word substitution (SWS) error in ASR results and then refine the corrupted word embeddings using multimodal information for final multimodal fusion. However, such a method includes a burdensome and ambiguous detection operation and ignores the inherent correlations and heterogeneity among different modalities. To address these issues, we propose a more compact system, termed ARF-MSA consisting of three key components to achieving robust MSA with SWS errors: 1) Alignment: we establish connections between the “text-acoustic’ and “text-visual” representations to effectively map the “text-acoustic-visual” data into a unified sentiment space by leveraging their multimodal correlation knowledge; 2) Refinement: we perform fine-grained comparisons between the text modality and the other two modalities in the unified sentiment space, enabling refinement of the sentiment expression within the text modality more concisely; 3) Fusion: Finally, we hierarchically fuse the dominant and non-dominant representation from three heterogeneity modalities to obtain the multimodal feature for MSA. We conduct extensive experiments on the real-world datasets and the results demonstrate the effectiveness of our model.
期刊介绍:
The IEEE Transactions on Affective Computing is an international and interdisciplinary journal. Its primary goal is to share research findings on the development of systems capable of recognizing, interpreting, and simulating human emotions and related affective phenomena. The journal publishes original research on the underlying principles and theories that explain how and why affective factors shape human-technology interactions. It also focuses on how techniques for sensing and simulating affect can enhance our understanding of human emotions and processes. Additionally, the journal explores the design, implementation, and evaluation of systems that prioritize the consideration of affect in their usability. We also welcome surveys of existing work that provide new perspectives on the historical and future directions of this field.