{"title":"Learning a Bimodal Emotion Recognition System Based on Small Amount of Speech Data","authors":"Junya Furutani, Xin Kang, Keita Kiuchi, Ryota Nishimura, M. Sasayama, Kazuyuki Matsumoto","doi":"10.1109/ICSAI57119.2022.10005454","DOIUrl":null,"url":null,"abstract":"This paper presents a bimodal emotion recognition system based on the voice and text information using small amount of speech data. Specifically, speech is divided into voice and text to learn emotion classifiers for each modal. The probabilities obtained from these emotion classifiers are weighted based on Mehrabian’s rule and summed up for each emotion to calculate the final score for bimodal emotion recognition. To create a highly accurate system while solving the problem that there are few Japanese speech data with emotion labels, we propose a novel data augmentation method and employ a transfer learning approach based on a pre-trained VGG16 model and a fine-tuned Bidirectional Encoder Representations from Transformers (BERT) model on the tweets. In order to prove the effectiveness of the proposed method, we revealed the recognition results for 7 emotional states, i.e., anger, sadness, joy, fear, surprise, disgust, and neutral. The experiment result suggested that our novel data augmentation method improved the accuracy and that bimodal predictions based on voice and text-based outperformed the single-model predictions.","PeriodicalId":339547,"journal":{"name":"2022 8th International Conference on Systems and Informatics (ICSAI)","volume":"105 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 8th International Conference on Systems and Informatics (ICSAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSAI57119.2022.10005454","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This paper presents a bimodal emotion recognition system based on the voice and text information using small amount of speech data. Specifically, speech is divided into voice and text to learn emotion classifiers for each modal. The probabilities obtained from these emotion classifiers are weighted based on Mehrabian’s rule and summed up for each emotion to calculate the final score for bimodal emotion recognition. To create a highly accurate system while solving the problem that there are few Japanese speech data with emotion labels, we propose a novel data augmentation method and employ a transfer learning approach based on a pre-trained VGG16 model and a fine-tuned Bidirectional Encoder Representations from Transformers (BERT) model on the tweets. In order to prove the effectiveness of the proposed method, we revealed the recognition results for 7 emotional states, i.e., anger, sadness, joy, fear, surprise, disgust, and neutral. The experiment result suggested that our novel data augmentation method improved the accuracy and that bimodal predictions based on voice and text-based outperformed the single-model predictions.