Using Speaker-Specific Emotion Representations in Wav2vec 2.0-Based Modules for Speech Emotion Recognition

Somin Park, Mpabulungi Mark, Bogyung Park, Hyunki Hong
{"title":"Using Speaker-Specific Emotion Representations in Wav2vec 2.0-Based Modules for Speech Emotion Recognition","authors":"Somin Park, Mpabulungi Mark, Bogyung Park, Hyunki Hong","doi":"10.32604/cmc.2023.041332","DOIUrl":null,"url":null,"abstract":"Speech emotion recognition is essential for frictionless human-machine interaction, where machines respond to human instructions with context-aware actions. The properties of individuals’ voices vary with culture, language, gender, and personality. These variations in speaker-specific properties may hamper the performance of standard representations in downstream tasks such as speech emotion recognition (SER). This study demonstrates the significance of speaker-specific speech characteristics and how considering them can be leveraged to improve the performance of SER models. In the proposed approach, two wav2vec-based modules (a speaker-identification network and an emotion classification network) are trained with the Arcface loss. The speaker-identification network has a single attention block to encode an input audio waveform into a speaker-specific representation. The emotion classification network uses a wav2vec 2.0-backbone as well as four attention blocks to encode the same input audio waveform into an emotion representation. These two representations are then fused into a single vector representation containing emotion and speaker-specific information. Experimental results showed that the use of speaker-specific characteristics improves SER performance. Additionally, combining these with an angular marginal loss such as the Arcface loss improves intra-class compactness while increasing inter-class separability, as demonstrated by the plots of t-distributed stochastic neighbor embeddings (t-SNE). The proposed approach outperforms previous methods using similar training strategies, with a weighted accuracy (WA) of 72.14% and unweighted accuracy (UA) of 72.97% on the Interactive Emotional Dynamic Motion Capture (IEMOCAP) dataset. This demonstrates its effectiveness and potential to enhance human-machine interaction through more accurate emotion recognition in speech.","PeriodicalId":93535,"journal":{"name":"Computers, materials & continua","volume":"63 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers, materials & continua","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.32604/cmc.2023.041332","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Speech emotion recognition is essential for frictionless human-machine interaction, where machines respond to human instructions with context-aware actions. The properties of individuals’ voices vary with culture, language, gender, and personality. These variations in speaker-specific properties may hamper the performance of standard representations in downstream tasks such as speech emotion recognition (SER). This study demonstrates the significance of speaker-specific speech characteristics and how considering them can be leveraged to improve the performance of SER models. In the proposed approach, two wav2vec-based modules (a speaker-identification network and an emotion classification network) are trained with the Arcface loss. The speaker-identification network has a single attention block to encode an input audio waveform into a speaker-specific representation. The emotion classification network uses a wav2vec 2.0-backbone as well as four attention blocks to encode the same input audio waveform into an emotion representation. These two representations are then fused into a single vector representation containing emotion and speaker-specific information. Experimental results showed that the use of speaker-specific characteristics improves SER performance. Additionally, combining these with an angular marginal loss such as the Arcface loss improves intra-class compactness while increasing inter-class separability, as demonstrated by the plots of t-distributed stochastic neighbor embeddings (t-SNE). The proposed approach outperforms previous methods using similar training strategies, with a weighted accuracy (WA) of 72.14% and unweighted accuracy (UA) of 72.97% on the Interactive Emotional Dynamic Motion Capture (IEMOCAP) dataset. This demonstrates its effectiveness and potential to enhance human-machine interaction through more accurate emotion recognition in speech.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
在基于Wav2vec 2.0的语音情感识别模块中使用说话人特定的情感表示
语音情感识别对于无摩擦的人机交互至关重要,在这种交互中,机器通过上下文感知动作响应人类指令。个体声音的属性随着文化、语言、性别和个性的不同而不同。说话者特定属性的这些变化可能会阻碍下游任务(如语音情感识别(SER))中标准表示的表现。本研究证明了说话人特定语音特征的重要性,以及如何利用这些特征来提高SER模型的性能。在提出的方法中,使用Arcface损失训练两个基于wav2vec的模块(说话人识别网络和情绪分类网络)。说话人识别网络具有单个注意块,用于将输入音频波形编码为说话人特定表示。情绪分类网络使用wav2vec 2.0骨干网和四个注意块将相同的输入音频波形编码为情绪表示。然后将这两种表示融合成一个包含情感和说话人特定信息的向量表示。实验结果表明,使用特定说话人的特征可以提高语音识别性能。此外,如t分布随机邻居嵌入(t-SNE)图所示,将这些与角边际损失(如Arcface损失)相结合可以提高类内紧密性,同时增加类间可分离性。该方法优于先前使用类似训练策略的方法,在交互式情绪动态动作捕捉(IEMOCAP)数据集上的加权精度(WA)为72.14%,非加权精度(UA)为72.97%。这表明了它通过更准确的语音情感识别来增强人机交互的有效性和潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A Smart Heart Disease Diagnostic System Using Deep Vanilla LSTM Retraction:A Hybrid Modified Sine CosineAlgorithm Using Inverse Filtering andClipping Methods forLow AutocorrelationBinary Sequences A Review of Smart Contract Blockchain Based on Multi-Criteria Analysis: Challenges and Motivations BLECA: A Blockchain-Based Lightweight and Efficient Cross-Domain Authentication Scheme for Smart Parks Internet of Things (IoT) Security Enhancement Using XGboost Machine Learning Techniques
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1