Using Speaker-Specific Emotion Representations in Wav2vec 2.0-Based Modules for Speech Emotion Recognition

Computers, materials & continua Pub Date : 2023-01-01 DOI:10.32604/cmc.2023.041332

Somin Park, Mpabulungi Mark, Bogyung Park, Hyunki Hong

{"title":"Using Speaker-Specific Emotion Representations in Wav2vec 2.0-Based Modules for Speech Emotion Recognition","authors":"Somin Park, Mpabulungi Mark, Bogyung Park, Hyunki Hong","doi":"10.32604/cmc.2023.041332","DOIUrl":null,"url":null,"abstract":"Speech emotion recognition is essential for frictionless human-machine interaction, where machines respond to human instructions with context-aware actions. The properties of individuals’ voices vary with culture, language, gender, and personality. These variations in speaker-specific properties may hamper the performance of standard representations in downstream tasks such as speech emotion recognition (SER). This study demonstrates the significance of speaker-specific speech characteristics and how considering them can be leveraged to improve the performance of SER models. In the proposed approach, two wav2vec-based modules (a speaker-identification network and an emotion classification network) are trained with the Arcface loss. The speaker-identification network has a single attention block to encode an input audio waveform into a speaker-specific representation. The emotion classification network uses a wav2vec 2.0-backbone as well as four attention blocks to encode the same input audio waveform into an emotion representation. These two representations are then fused into a single vector representation containing emotion and speaker-specific information. Experimental results showed that the use of speaker-specific characteristics improves SER performance. Additionally, combining these with an angular marginal loss such as the Arcface loss improves intra-class compactness while increasing inter-class separability, as demonstrated by the plots of t-distributed stochastic neighbor embeddings (t-SNE). The proposed approach outperforms previous methods using similar training strategies, with a weighted accuracy (WA) of 72.14% and unweighted accuracy (UA) of 72.97% on the Interactive Emotional Dynamic Motion Capture (IEMOCAP) dataset. This demonstrates its effectiveness and potential to enhance human-machine interaction through more accurate emotion recognition in speech.","PeriodicalId":93535,"journal":{"name":"Computers, materials & continua","volume":"63 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers, materials & continua","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.32604/cmc.2023.041332","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Speech emotion recognition is essential for frictionless human-machine interaction, where machines respond to human instructions with context-aware actions. The properties of individuals’ voices vary with culture, language, gender, and personality. These variations in speaker-specific properties may hamper the performance of standard representations in downstream tasks such as speech emotion recognition (SER). This study demonstrates the significance of speaker-specific speech characteristics and how considering them can be leveraged to improve the performance of SER models. In the proposed approach, two wav2vec-based modules (a speaker-identification network and an emotion classification network) are trained with the Arcface loss. The speaker-identification network has a single attention block to encode an input audio waveform into a speaker-specific representation. The emotion classification network uses a wav2vec 2.0-backbone as well as four attention blocks to encode the same input audio waveform into an emotion representation. These two representations are then fused into a single vector representation containing emotion and speaker-specific information. Experimental results showed that the use of speaker-specific characteristics improves SER performance. Additionally, combining these with an angular marginal loss such as the Arcface loss improves intra-class compactness while increasing inter-class separability, as demonstrated by the plots of t-distributed stochastic neighbor embeddings (t-SNE). The proposed approach outperforms previous methods using similar training strategies, with a weighted accuracy (WA) of 72.14% and unweighted accuracy (UA) of 72.97% on the Interactive Emotional Dynamic Motion Capture (IEMOCAP) dataset. This demonstrates its effectiveness and potential to enhance human-machine interaction through more accurate emotion recognition in speech.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

在基于Wav2vec 2.0的语音情感识别模块中使用说话人特定的情感表示

语音情感识别对于无摩擦的人机交互至关重要，在这种交互中，机器通过上下文感知动作响应人类指令。个体声音的属性随着文化、语言、性别和个性的不同而不同。说话者特定属性的这些变化可能会阻碍下游任务(如语音情感识别(SER))中标准表示的表现。本研究证明了说话人特定语音特征的重要性，以及如何利用这些特征来提高SER模型的性能。在提出的方法中，使用Arcface损失训练两个基于wav2vec的模块(说话人识别网络和情绪分类网络)。说话人识别网络具有单个注意块，用于将输入音频波形编码为说话人特定表示。情绪分类网络使用wav2vec 2.0骨干网和四个注意块将相同的输入音频波形编码为情绪表示。然后将这两种表示融合成一个包含情感和说话人特定信息的向量表示。实验结果表明，使用特定说话人的特征可以提高语音识别性能。此外，如t分布随机邻居嵌入(t-SNE)图所示，将这些与角边际损失(如Arcface损失)相结合可以提高类内紧密性，同时增加类间可分离性。该方法优于先前使用类似训练策略的方法，在交互式情绪动态动作捕捉(IEMOCAP)数据集上的加权精度(WA)为72.14%，非加权精度(UA)为72.97%。这表明了它通过更准确的语音情感识别来增强人机交互的有效性和潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computers, materials & continua

自引率

0.00%

发文量