{"title":"Speech emotion recognition using energy based adaptive mode selection","authors":"Ravi, Sachin Taran","doi":"10.1016/j.specom.2025.103228","DOIUrl":null,"url":null,"abstract":"<div><div>In this framework, a speech emotion recognition approach is presented, relying on Variational Mode Decomposition (VMD) and adaptive mode selection utilizing energy information. Instead of directly analyzing speech signals this work is focused on the preprocessing of raw speech signals. Initially, a given speech signal is decomposed using VMD and then the energy of each mode is calculated. Based on energy estimation, the dominant modes are selected for signal reconstruction. VMD combined with energy estimation improves the predictability of the reconstructed speech signal. The improvement in predictability is demonstrated using root mean square and spectral entropy measures. The reconstructed signal is divided into frames, and prosodic and spectral features are then calculated. Following feature extraction, ReliefF algorithm is utilized for the feature optimization. The resultant feature set is utilized to train the fine K- nearest neighbor classifier for emotion identification. The proposed framework was tested on publicly available acted and elicited datasets. For the acted datasets, the proposed framework achieved 93.8 %, 95.8 %, and 93.4 % accuracy on different language-based RAVDESS-speech, Emo-DB, and EMOVO datasets. Furthermore, the proposed method has also proven to be robust across three languages: English, German, and Italian, with language sensitivity as low as 2.4 % compared to existing methods. For the elicited dataset IEMOCAP, the proposed framework achieved the highest accuracy of 83.1 % compared to the existing state of the art.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103228"},"PeriodicalIF":3.0000,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639325000433","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0
Abstract
In this framework, a speech emotion recognition approach is presented, relying on Variational Mode Decomposition (VMD) and adaptive mode selection utilizing energy information. Instead of directly analyzing speech signals this work is focused on the preprocessing of raw speech signals. Initially, a given speech signal is decomposed using VMD and then the energy of each mode is calculated. Based on energy estimation, the dominant modes are selected for signal reconstruction. VMD combined with energy estimation improves the predictability of the reconstructed speech signal. The improvement in predictability is demonstrated using root mean square and spectral entropy measures. The reconstructed signal is divided into frames, and prosodic and spectral features are then calculated. Following feature extraction, ReliefF algorithm is utilized for the feature optimization. The resultant feature set is utilized to train the fine K- nearest neighbor classifier for emotion identification. The proposed framework was tested on publicly available acted and elicited datasets. For the acted datasets, the proposed framework achieved 93.8 %, 95.8 %, and 93.4 % accuracy on different language-based RAVDESS-speech, Emo-DB, and EMOVO datasets. Furthermore, the proposed method has also proven to be robust across three languages: English, German, and Italian, with language sensitivity as low as 2.4 % compared to existing methods. For the elicited dataset IEMOCAP, the proposed framework achieved the highest accuracy of 83.1 % compared to the existing state of the art.
期刊介绍:
Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results.
The journal''s primary objectives are:
• to present a forum for the advancement of human and human-machine speech communication science;
• to stimulate cross-fertilization between different fields of this domain;
• to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.