刺激模式很重要：不同模式的感知评估对语音情感识别系统性能的影响

arXiv - EE - Signal Processing Pub Date : 2024-09-16 DOI:arxiv-2409.10762

Huang-Cheng Chou, Haibin Wu, Chi-Chun Lee

{"title":"刺激模式很重要：不同模式的感知评估对语音情感识别系统性能的影响","authors":"Huang-Cheng Chou, Haibin Wu, Chi-Chun Lee","doi":"arxiv-2409.10762","DOIUrl":null,"url":null,"abstract":"Speech Emotion Recognition (SER) systems rely on speech input and emotional\nlabels annotated by humans. However, various emotion databases collect\nperceptional evaluations in different ways. For instance, the IEMOCAP dataset\nuses video clips with sounds for annotators to provide their emotional\nperceptions. However, the most significant English emotion dataset, the\nMSP-PODCAST, only provides speech for raters to choose the emotional ratings.\nNevertheless, using speech as input is the standard approach to training SER\nsystems. Therefore, the open question is the emotional labels elicited by which\nscenarios are the most effective for training SER systems. We comprehensively\ncompare the effectiveness of SER systems trained with labels elicited by\ndifferent modality stimuli and evaluate the SER systems on various testing\nconditions. Also, we introduce an all-inclusive label that combines all labels\nelicited by various modalities. We show that using labels elicited by\nvoice-only stimuli for training yields better performance on the test set,\nwhereas labels elicited by voice-only stimuli.","PeriodicalId":501034,"journal":{"name":"arXiv - EE - Signal Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Stimulus Modality Matters: Impact of Perceptual Evaluations from Different Modalities on Speech Emotion Recognition System Performance\",\"authors\":\"Huang-Cheng Chou, Haibin Wu, Chi-Chun Lee\",\"doi\":\"arxiv-2409.10762\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Speech Emotion Recognition (SER) systems rely on speech input and emotional\\nlabels annotated by humans. However, various emotion databases collect\\nperceptional evaluations in different ways. For instance, the IEMOCAP dataset\\nuses video clips with sounds for annotators to provide their emotional\\nperceptions. However, the most significant English emotion dataset, the\\nMSP-PODCAST, only provides speech for raters to choose the emotional ratings.\\nNevertheless, using speech as input is the standard approach to training SER\\nsystems. Therefore, the open question is the emotional labels elicited by which\\nscenarios are the most effective for training SER systems. We comprehensively\\ncompare the effectiveness of SER systems trained with labels elicited by\\ndifferent modality stimuli and evaluate the SER systems on various testing\\nconditions. Also, we introduce an all-inclusive label that combines all labels\\nelicited by various modalities. We show that using labels elicited by\\nvoice-only stimuli for training yields better performance on the test set,\\nwhereas labels elicited by voice-only stimuli.\",\"PeriodicalId\":501034,\"journal\":{\"name\":\"arXiv - EE - Signal Processing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - EE - Signal Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.10762\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Signal Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10762","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

语音情感识别（SER）系统依赖于语音输入和人类注释的情感标签。然而，各种情感数据库收集感知评估的方式各不相同。例如，IEMOCAP 数据集利用带有声音的视频片段让注释者提供他们的情感感知。然而，最重要的英语情感数据集--MSP-PODCAST--只提供语音供评分者选择情感评分。然而，使用语音作为输入是训练 SER 系统的标准方法。然而，使用语音作为输入是训练 SER 系统的标准方法。因此，一个悬而未决的问题是，哪种情景下激发的情感标签对训练 SER 系统最有效。我们全面比较了使用不同模式刺激激发的标签训练 SER 系统的效果，并在各种测试条件下对 SER 系统进行了评估。此外，我们还引入了一种包罗万象的标签，它结合了各种模态激发的所有标签。我们的研究表明，使用纯声音刺激激发的标签进行训练，在测试集上的表现比使用纯声音刺激激发的标签更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Stimulus Modality Matters: Impact of Perceptual Evaluations from Different Modalities on Speech Emotion Recognition System Performance

Speech Emotion Recognition (SER) systems rely on speech input and emotional labels annotated by humans. However, various emotion databases collect perceptional evaluations in different ways. For instance, the IEMOCAP dataset uses video clips with sounds for annotators to provide their emotional perceptions. However, the most significant English emotion dataset, the MSP-PODCAST, only provides speech for raters to choose the emotional ratings. Nevertheless, using speech as input is the standard approach to training SER systems. Therefore, the open question is the emotional labels elicited by which scenarios are the most effective for training SER systems. We comprehensively compare the effectiveness of SER systems trained with labels elicited by different modality stimuli and evaluate the SER systems on various testing conditions. Also, we introduce an all-inclusive label that combines all labels elicited by various modalities. We show that using labels elicited by voice-only stimuli for training yields better performance on the test set, whereas labels elicited by voice-only stimuli.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - EE - Signal Processing

自引率

0.00%

发文量