Multimodal aspect-based sentiment analysis (MABSA) aims to determine the sentiment polarity of each aspect mentioned in the text based on multimodal content. Various approaches have been proposed to model multimodal sentiment features for each aspect via modal interactions. However, most existing approaches have two shortcomings: (1) The representation gap between textual and visual modalities may increase the risk of misalignment in modal interactions; (2) In some examples where the image is not related to the text, the visual information may not enrich the textual modality when learning aspect-based sentiment features. In such cases, blindly leveraging visual information may introduce noises in reasoning the aspect-based sentiment expressions. To tackle these shortcomings, we propose an end-to-end MABSA framework with image conversion and noise filtration. Specifically, to bridge the representation gap in different modalities, we attempt to translate images into the input space of a pre-trained language model (PLM). To this end, we develop an image-to-text conversion module that can convert an image to an implicit sequence of token embedding. Moreover, an aspect-oriented filtration module is devised to alleviate the noise in the implicit token embeddings, which consists of two attention operations. After filtering the noise, we leverage a PLM to encode the text, aspect, and image prompt derived from filtered implicit token embeddings as sentiment features to perform aspect-based sentiment prediction. Experimental results on two MABSA datasets show that our framework achieves state-of-the-art performance. Furthermore, extensive experimental analysis demonstrates the proposed framework has superior robustness and efficiency.
{"title":"Image-to-Text Conversion and Aspect-Oriented Filtration for Multimodal Aspect-Based Sentiment Analysis","authors":"Qianlong Wang;Hongling Xu;Zhiyuan Wen;Bin Liang;Min Yang;Bing Qin;Ruifeng Xu","doi":"10.1109/TAFFC.2023.3333200","DOIUrl":"10.1109/TAFFC.2023.3333200","url":null,"abstract":"Multimodal aspect-based sentiment analysis (MABSA) aims to determine the sentiment polarity of each aspect mentioned in the text based on multimodal content. Various approaches have been proposed to model multimodal sentiment features for each aspect via modal interactions. However, most existing approaches have two shortcomings: (1) The representation gap between textual and visual modalities may increase the risk of misalignment in modal interactions; (2) In some examples where the image is not related to the text, the visual information may not enrich the textual modality when learning aspect-based sentiment features. In such cases, blindly leveraging visual information may introduce noises in reasoning the aspect-based sentiment expressions. To tackle these shortcomings, we propose an end-to-end MABSA framework with image conversion and noise filtration. Specifically, to bridge the representation gap in different modalities, we attempt to translate images into the input space of a pre-trained language model (PLM). To this end, we develop an image-to-text conversion module that can convert an image to an implicit sequence of token embedding. Moreover, an aspect-oriented filtration module is devised to alleviate the noise in the implicit token embeddings, which consists of two attention operations. After filtering the noise, we leverage a PLM to encode the text, aspect, and image prompt derived from filtered implicit token embeddings as sentiment features to perform aspect-based sentiment prediction. Experimental results on two MABSA datasets show that our framework achieves state-of-the-art performance. Furthermore, extensive experimental analysis demonstrates the proposed framework has superior robustness and efficiency.","PeriodicalId":13131,"journal":{"name":"IEEE Transactions on Affective Computing","volume":"15 3","pages":"1264-1278"},"PeriodicalIF":9.6,"publicationDate":"2023-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135709632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-14DOI: 10.1109/TAFFC.2023.3332631
Shuhe Zhang;Haifeng Hu;Songlong Xing
Emotion recognition in conversation (ERC) and emotional response generation (ERG) are two important NLP tasks. ERC aims to detect the utterance-level emotion from a dialogue, while ERG focuses on expressing a desired emotion. Essentially, ERC is a classification task, with its input and output domains being the utterance text and emotion labels, respectively. On the other hand, ERG is a generation task with its input and output domains being the opposite. These two tasks are highly related, but surprisingly, they are addressed independently without making use of their duality in prior works. Therefore, in this article, we propose to solve these two tasks in a dual learning framework. Our contributions are fourfold: (1) We propose a dual learning framework for ERC and ERG. (2) Within the proposed framework, two models can be trained jointly, so that the duality between them can be utilised. (3) Instead of a symmetric framework that deals with two tasks of the same data domain, we propose a dual learning framework that performs on a pair of asymmetric input and output spaces, i.e., the natural language space and the emotion labels. (4) Experiments are conducted on benchmark datasets to demonstrate the effectiveness of our framework.
{"title":"Dual Learning for Conversational Emotion Recognition and Emotional Response Generation","authors":"Shuhe Zhang;Haifeng Hu;Songlong Xing","doi":"10.1109/TAFFC.2023.3332631","DOIUrl":"10.1109/TAFFC.2023.3332631","url":null,"abstract":"Emotion recognition in conversation (ERC) and emotional response generation (ERG) are two important NLP tasks. ERC aims to detect the utterance-level emotion from a dialogue, while ERG focuses on expressing a desired emotion. Essentially, ERC is a classification task, with its input and output domains being the utterance text and emotion labels, respectively. On the other hand, ERG is a generation task with its input and output domains being the opposite. These two tasks are highly related, but surprisingly, they are addressed independently without making use of their duality in prior works. Therefore, in this article, we propose to solve these two tasks in a dual learning framework. Our contributions are fourfold: (1) We propose a dual learning framework for ERC and ERG. (2) Within the proposed framework, two models can be trained jointly, so that the duality between them can be utilised. (3) Instead of a symmetric framework that deals with two tasks of the same data domain, we propose a dual learning framework that performs on a pair of asymmetric input and output spaces, i.e., the natural language space and the emotion labels. (4) Experiments are conducted on benchmark datasets to demonstrate the effectiveness of our framework.","PeriodicalId":13131,"journal":{"name":"IEEE Transactions on Affective Computing","volume":"15 3","pages":"1241-1252"},"PeriodicalIF":9.6,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135703713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-14DOI: 10.1109/TAFFC.2023.3332742
Fotis Efthymiou;Christian Hildebrand
Recent advances in artificial speech synthesis and machine learning equip AI-powered conversational agents, from voice assistants to social robots, with the ability to mimic human emotional expression during their interactions with users. One unexplored development is the ability to design machine-generated voices that induce varying levels of “shakiness” (i.e., trembling) in the agents’ voices. In the current work, we examine how the trembling voice of a conversational AI impacts users’ perceptions, affective experiences, and their subsequent behavior. Across three studies, we demonstrate that a trembling voice enhances the perceived psychological vulnerability of the agent, followed by a heightened sense of empathic concern, ultimately increasing people's willingness to donate in a prosocial charity context. We provide further evidence from a large-scale field experiment that conversational agents with a trembling voice lead to increased click-through rates and decreased costs-per-impression in an online charity advertising setting. These findings deepen our understanding of the nuanced impact of intentionally designed voices of conversational AI agents on humans and highlight the ethical and societal challenges that arise.
{"title":"Empathy by Design: The Influence of Trembling AI Voices on Prosocial Behavior","authors":"Fotis Efthymiou;Christian Hildebrand","doi":"10.1109/TAFFC.2023.3332742","DOIUrl":"10.1109/TAFFC.2023.3332742","url":null,"abstract":"Recent advances in artificial speech synthesis and machine learning equip AI-powered conversational agents, from voice assistants to social robots, with the ability to mimic human emotional expression during their interactions with users. One unexplored development is the ability to design machine-generated voices that induce varying levels of “shakiness” (i.e., trembling) in the agents’ voices. In the current work, we examine how the trembling voice of a conversational AI impacts users’ perceptions, affective experiences, and their subsequent behavior. Across three studies, we demonstrate that a trembling voice enhances the perceived psychological vulnerability of the agent, followed by a heightened sense of empathic concern, ultimately increasing people's willingness to donate in a prosocial charity context. We provide further evidence from a large-scale field experiment that conversational agents with a trembling voice lead to increased click-through rates and decreased costs-per-impression in an online charity advertising setting. These findings deepen our understanding of the nuanced impact of intentionally designed voices of conversational AI agents on humans and highlight the ethical and societal challenges that arise.","PeriodicalId":13131,"journal":{"name":"IEEE Transactions on Affective Computing","volume":"15 3","pages":"1253-1263"},"PeriodicalIF":9.6,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10316625","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135704990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-02DOI: 10.1109/TAFFC.2023.3329563
Soraia M. Alarcão;Vânia Mendonça;Cláudia Sevivas;Carolina Maruta;Manuel J. Fonseca
The success of supervised models for emotion recognition on images heavily depends on the availability of images properly annotated. Although millions of images are presently available, only a few are annotated with reliable emotional information. Current emotion recognition solutions either use large amounts of weakly-labeled web images, which often contain noise that is unrelated to the emotions of the image, or transfer learning, which usually results in performance losses. Thus, it would be desirable to know which images would be useful to be annotated to avoid an extensive annotation effort. In this paper, we propose a novel approach based on active learning to choose which images are more relevant to be annotated. Our approach dynamically combines multiple active learning strategies and learns the best ones (without prior knowledge of the best ones). Experiments using nine benchmark datasets revealed that: (i) active learning allows to reduce the annotation effort, while reaching or surpassing the performance of a supervised baseline with as little as 3% to 18% of the baseline's training set, in classification tasks; (ii) our online combination of multiple strategies converges to the performance of the best individual strategies, while avoiding the experimentation overhead needed to identify them.
{"title":"Annotate Smarter, not Harder: Using Active Learning to Reduce Emotional Annotation Effort","authors":"Soraia M. Alarcão;Vânia Mendonça;Cláudia Sevivas;Carolina Maruta;Manuel J. Fonseca","doi":"10.1109/TAFFC.2023.3329563","DOIUrl":"10.1109/TAFFC.2023.3329563","url":null,"abstract":"The success of supervised models for emotion recognition on images heavily depends on the availability of images properly annotated. Although millions of images are presently available, only a few are annotated with reliable emotional information. Current emotion recognition solutions either use large amounts of weakly-labeled web images, which often contain noise that is unrelated to the emotions of the image, or transfer learning, which usually results in performance losses. Thus, it would be desirable to know which images would be useful to be annotated to avoid an extensive annotation effort. In this paper, we propose a novel approach based on active learning to choose which images are more relevant to be annotated. Our approach dynamically combines multiple active learning strategies and learns the best ones (without prior knowledge of the best ones). Experiments using nine benchmark datasets revealed that: (i) active learning allows to reduce the annotation effort, while reaching or surpassing the performance of a supervised baseline with as little as 3% to 18% of the baseline's training set, in classification tasks; (ii) our online combination of multiple strategies converges to the performance of the best individual strategies, while avoiding the experimentation overhead needed to identify them.","PeriodicalId":13131,"journal":{"name":"IEEE Transactions on Affective Computing","volume":"15 3","pages":"1213-1227"},"PeriodicalIF":9.6,"publicationDate":"2023-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134890813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-02DOI: 10.1109/TAFFC.2023.3329526
Xinyuan Wang;Danli Wang;Xuange Gao;Yanyan Zhao;Steve C. Chiu
Emotions are important factors in decision-making. With the advent of brain-computer interface (BCI) techniques, researchers developed a strong interest in predicting decisions based on emotions, which is a challenging task. To predict decision-making performance using emotion, we have proposed the Maximizing Mutual Information between Emotion and Decision relevant features (MMI-ED) method, with three modules: (1) Temporal-spatial encoding module captures spatial correlation and temporal dependence from electroencephalogram (EEG) signals; (2) Relevant feature decomposition module extracts emotion-relevant features and decision-relevant features; (3) Relevant feature fusion module maximizes the mutual information to incorporate useful emotion-related feature information during the decision-making prediction process. To construct a dataset that uses emotions to predict decision-making performance, we designed an experiment involving emotion elicitation and decision-making tasks and collected EEG, behavioral, and subjective data. We performed a comparison of our model with several emotion recognition and motion imagery models using our dataset. The results demonstrate that our model achieved state-of-the-art performance, achieving a classification accuracy of 92.96 $%$