Adham Ibrahim, Shady Shehata, Ajinkya Kulkarni, Mukhtar Mohamed, Muhammad Abdul-Mageed
{"title":"What Does it Take to Generalize SER Model Across Datasets? A Comprehensive Benchmark","authors":"Adham Ibrahim, Shady Shehata, Ajinkya Kulkarni, Mukhtar Mohamed, Muhammad Abdul-Mageed","doi":"arxiv-2406.09933","DOIUrl":null,"url":null,"abstract":"Speech emotion recognition (SER) is essential for enhancing human-computer\ninteraction in speech-based applications. Despite improvements in specific\nemotional datasets, there is still a research gap in SER's capability to\ngeneralize across real-world situations. In this paper, we investigate\napproaches to generalize the SER system across different emotion datasets. In\nparticular, we incorporate 11 emotional speech datasets and illustrate a\ncomprehensive benchmark on the SER task. We also address the challenge of\nimbalanced data distribution using over-sampling methods when combining SER\ndatasets for training. Furthermore, we explore various evaluation protocols for\nadeptness in the generalization of SER. Building on this, we explore the\npotential of Whisper for SER, emphasizing the importance of thorough\nevaluation. Our approach is designed to advance SER technology by integrating\nspeaker-independent methods.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.09933","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Speech emotion recognition (SER) is essential for enhancing human-computer
interaction in speech-based applications. Despite improvements in specific
emotional datasets, there is still a research gap in SER's capability to
generalize across real-world situations. In this paper, we investigate
approaches to generalize the SER system across different emotion datasets. In
particular, we incorporate 11 emotional speech datasets and illustrate a
comprehensive benchmark on the SER task. We also address the challenge of
imbalanced data distribution using over-sampling methods when combining SER
datasets for training. Furthermore, we explore various evaluation protocols for
adeptness in the generalization of SER. Building on this, we explore the
potential of Whisper for SER, emphasizing the importance of thorough
evaluation. Our approach is designed to advance SER technology by integrating
speaker-independent methods.
语音情感识别(SER)对于增强语音应用中的人机交互至关重要。尽管在特定情感数据集方面有所改进,但在 SER 在现实世界中的泛化能力方面仍存在研究空白。在本文中,我们研究了将 SER 系统泛化到不同情感数据集的方法。特别是,我们纳入了 11 个情感语音数据集,并说明了 SER 任务的综合基准。我们还利用过度采样方法解决了在结合 SER 数据集进行训练时数据分布不平衡的难题。此外,我们还探索了各种评估协议,以评估 SER 的泛化能力。在此基础上,我们探讨了 Whisper 在 SER 方面的潜力,强调了彻底评估的重要性。我们的方法旨在通过整合与扬声器无关的方法来推动 SER 技术的发展。