Multimodal Emotion Recognition Using Contextualized Audio Information and Ground Transcripts on Multiple Datasets

IF 2.9 4区 综合性期刊 Q2 MULTIDISCIPLINARY SCIENCES Arabian Journal for Science and Engineering Pub Date : 2023-11-07 DOI:10.1007/s13369-023-08395-3
Krishna Chauhan, Kamalesh Kumar Sharma, Tarun Varma
{"title":"Multimodal Emotion Recognition Using Contextualized Audio Information and Ground Transcripts on Multiple Datasets","authors":"Krishna Chauhan,&nbsp;Kamalesh Kumar Sharma,&nbsp;Tarun Varma","doi":"10.1007/s13369-023-08395-3","DOIUrl":null,"url":null,"abstract":"<div><p>The widespread applications of emotion recognition (ER) in various fields have recently attracted much attention from researchers. Consequently, an array of advanced techniques has emerged, driven by enhancing the accuracy and robustness of these recognition systems. As emotional dialogue comprises sound and spoken content, the proposed model encodes the information from audio and text sequences using two separate channels and merges them for emotional classification. The two channels used inputs from audio and text modalities. The audio channel is encoded using a deep convolutional neural network with residual connections and further transformed using a self-attention-based multihead attention network called channel-wise global head pooling. Unlike the vanilla multihead attention network, an adaptive global pooling is used after concatenating all the heads. The text channel is encoded using a pre-trained BERT model. The proposed ER method is validated on four benchmark databases: Interactive Emotional Dyadic Motion Capture in English, the Berlin emotional speech dataset in the German language, Ryerson Audio-Visual Database of Emotional Speech and Song in North American English and Crowd-sourced Emotional Multimodal Actors Dataset in English. The classification accuracy on the above emotional corpora is 85.71%, 79.52%, 76.71% and 73.91%, respectively. Furthermore, cross-corpus analysis is presented to understand the variability of speech and text features and the robustness of the proposed architecture.</p></div>","PeriodicalId":54354,"journal":{"name":"Arabian Journal for Science and Engineering","volume":"49 9","pages":"11871 - 11881"},"PeriodicalIF":2.9000,"publicationDate":"2023-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Arabian Journal for Science and Engineering","FirstCategoryId":"103","ListUrlMain":"https://link.springer.com/article/10.1007/s13369-023-08395-3","RegionNum":4,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0

Abstract

The widespread applications of emotion recognition (ER) in various fields have recently attracted much attention from researchers. Consequently, an array of advanced techniques has emerged, driven by enhancing the accuracy and robustness of these recognition systems. As emotional dialogue comprises sound and spoken content, the proposed model encodes the information from audio and text sequences using two separate channels and merges them for emotional classification. The two channels used inputs from audio and text modalities. The audio channel is encoded using a deep convolutional neural network with residual connections and further transformed using a self-attention-based multihead attention network called channel-wise global head pooling. Unlike the vanilla multihead attention network, an adaptive global pooling is used after concatenating all the heads. The text channel is encoded using a pre-trained BERT model. The proposed ER method is validated on four benchmark databases: Interactive Emotional Dyadic Motion Capture in English, the Berlin emotional speech dataset in the German language, Ryerson Audio-Visual Database of Emotional Speech and Song in North American English and Crowd-sourced Emotional Multimodal Actors Dataset in English. The classification accuracy on the above emotional corpora is 85.71%, 79.52%, 76.71% and 73.91%, respectively. Furthermore, cross-corpus analysis is presented to understand the variability of speech and text features and the robustness of the proposed architecture.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
在多个数据集上使用上下文化音频信息和地面记录进行多模态情感识别
情感识别(ER)在各个领域的广泛应用最近引起了研究人员的极大关注。因此,为了提高这些识别系统的准确性和鲁棒性,出现了一系列先进的技术。由于情感对话包括声音和口语内容,因此所提出的模型使用两个独立通道对音频和文本序列信息进行编码,并将它们合并进行情感分类。这两个通道分别使用音频和文本模式的输入。音频通道使用具有残差连接的深度卷积神经网络进行编码,并使用基于自我注意力的多头注意力网络(称为通道式全局头集合)进行进一步转换。与普通多头注意力网络不同的是,在连接所有头部后,会使用自适应全局池化。文本信道使用预先训练的 BERT 模型进行编码。提出的 ER 方法在四个基准数据库上进行了验证:这四个基准数据库是:英语交互式情感动作捕捉、德语柏林情感语音数据集、北美英语情感语音和歌曲视听数据库 Ryerson 和英语众包情感多模态演员数据集。上述情感语料库的分类准确率分别为 85.71%、79.52%、76.71% 和 73.91%。此外,还进行了跨语料库分析,以了解语音和文本特征的可变性以及所提架构的鲁棒性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Arabian Journal for Science and Engineering
Arabian Journal for Science and Engineering MULTIDISCIPLINARY SCIENCES-
CiteScore
5.70
自引率
3.40%
发文量
993
期刊介绍: King Fahd University of Petroleum & Minerals (KFUPM) partnered with Springer to publish the Arabian Journal for Science and Engineering (AJSE). AJSE, which has been published by KFUPM since 1975, is a recognized national, regional and international journal that provides a great opportunity for the dissemination of research advances from the Kingdom of Saudi Arabia, MENA and the world.
期刊最新文献
Generative Adversarial Networks for Intrusion Detection Systems: A Comprehensive Survey of Applications, Challenges, and Research Directions Breathing Cycle Detection for Respiratory Tele-health Systems Enhanced Voltammetric Detection of Selected Antibiotic Residues in Dairy Products Utilizing Iron-doped Zeolite-Modified Carbon Paste Electrode Graphene-Based Nanomaterials for Strain Detection in Smart Concrete Deep Learning Approaches to Evaluating ADHD Using EEG Data: RNN, GRU, and LSTM Models
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1