Multimodal Emotion Recognition Using Contextualized Audio Information and Ground Transcripts on Multiple Datasets

IF 2.9 4区综合性期刊 Q2 MULTIDISCIPLINARY SCIENCES Arabian Journal for Science and Engineering Pub Date : 2023-11-07 DOI:10.1007/s13369-023-08395-3

Krishna Chauhan, Kamalesh Kumar Sharma, Tarun Varma

{"title":"Multimodal Emotion Recognition Using Contextualized Audio Information and Ground Transcripts on Multiple Datasets","authors":"Krishna Chauhan, Kamalesh Kumar Sharma, Tarun Varma","doi":"10.1007/s13369-023-08395-3","DOIUrl":null,"url":null,"abstract":"<div><p>The widespread applications of emotion recognition (ER) in various fields have recently attracted much attention from researchers. Consequently, an array of advanced techniques has emerged, driven by enhancing the accuracy and robustness of these recognition systems. As emotional dialogue comprises sound and spoken content, the proposed model encodes the information from audio and text sequences using two separate channels and merges them for emotional classification. The two channels used inputs from audio and text modalities. The audio channel is encoded using a deep convolutional neural network with residual connections and further transformed using a self-attention-based multihead attention network called channel-wise global head pooling. Unlike the vanilla multihead attention network, an adaptive global pooling is used after concatenating all the heads. The text channel is encoded using a pre-trained BERT model. The proposed ER method is validated on four benchmark databases: Interactive Emotional Dyadic Motion Capture in English, the Berlin emotional speech dataset in the German language, Ryerson Audio-Visual Database of Emotional Speech and Song in North American English and Crowd-sourced Emotional Multimodal Actors Dataset in English. The classification accuracy on the above emotional corpora is 85.71%, 79.52%, 76.71% and 73.91%, respectively. Furthermore, cross-corpus analysis is presented to understand the variability of speech and text features and the robustness of the proposed architecture.</p></div>","PeriodicalId":54354,"journal":{"name":"Arabian Journal for Science and Engineering","volume":"49 9","pages":"11871 - 11881"},"PeriodicalIF":2.9000,"publicationDate":"2023-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Arabian Journal for Science and Engineering","FirstCategoryId":"103","ListUrlMain":"https://link.springer.com/article/10.1007/s13369-023-08395-3","RegionNum":4,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

The widespread applications of emotion recognition (ER) in various fields have recently attracted much attention from researchers. Consequently, an array of advanced techniques has emerged, driven by enhancing the accuracy and robustness of these recognition systems. As emotional dialogue comprises sound and spoken content, the proposed model encodes the information from audio and text sequences using two separate channels and merges them for emotional classification. The two channels used inputs from audio and text modalities. The audio channel is encoded using a deep convolutional neural network with residual connections and further transformed using a self-attention-based multihead attention network called channel-wise global head pooling. Unlike the vanilla multihead attention network, an adaptive global pooling is used after concatenating all the heads. The text channel is encoded using a pre-trained BERT model. The proposed ER method is validated on four benchmark databases: Interactive Emotional Dyadic Motion Capture in English, the Berlin emotional speech dataset in the German language, Ryerson Audio-Visual Database of Emotional Speech and Song in North American English and Crowd-sourced Emotional Multimodal Actors Dataset in English. The classification accuracy on the above emotional corpora is 85.71%, 79.52%, 76.71% and 73.91%, respectively. Furthermore, cross-corpus analysis is presented to understand the variability of speech and text features and the robustness of the proposed architecture.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

在多个数据集上使用上下文化音频信息和地面记录进行多模态情感识别

情感识别（ER）在各个领域的广泛应用最近引起了研究人员的极大关注。因此，为了提高这些识别系统的准确性和鲁棒性，出现了一系列先进的技术。由于情感对话包括声音和口语内容，因此所提出的模型使用两个独立通道对音频和文本序列信息进行编码，并将它们合并进行情感分类。这两个通道分别使用音频和文本模式的输入。音频通道使用具有残差连接的深度卷积神经网络进行编码，并使用基于自我注意力的多头注意力网络（称为通道式全局头集合）进行进一步转换。与普通多头注意力网络不同的是，在连接所有头部后，会使用自适应全局池化。文本信道使用预先训练的 BERT 模型进行编码。提出的 ER 方法在四个基准数据库上进行了验证：这四个基准数据库是：英语交互式情感动作捕捉、德语柏林情感语音数据集、北美英语情感语音和歌曲视听数据库 Ryerson 和英语众包情感多模态演员数据集。上述情感语料库的分类准确率分别为 85.71%、79.52%、76.71% 和 73.91%。此外，还进行了跨语料库分析，以了解语音和文本特征的可变性以及所提架构的鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Arabian Journal for Science and Engineering MULTIDISCIPLINARY SCIENCES-

CiteScore

5.70

自引率

3.40%

发文量

993

期刊介绍： King Fahd University of Petroleum & Minerals (KFUPM) partnered with Springer to publish the Arabian Journal for Science and Engineering (AJSE). AJSE, which has been published by KFUPM since 1975, is a recognized national, regional and international journal that provides a great opportunity for the dissemination of research advances from the Kingdom of Saudi Arabia, MENA and the world.