{"title":"Fixed frequency range empirical wavelet transform based acoustic and entropy features for speech emotion recognition","authors":"Siba Prasad Mishra, Pankaj Warule, Suman Deb","doi":"10.1016/j.specom.2024.103148","DOIUrl":null,"url":null,"abstract":"<div><div>The primary goal of automated speech emotion recognition (SER) is to accurately and effectively identify each specific emotion conveyed in a speech signal utilizing machines such as computers and mobile devices. The widespread recognition of the popularity of SER among academics for three decades is mainly attributed to its broad application in practical scenarios. The utilization of SER has proven to be beneficial in various fields, such as medical intervention, bolstering safety strategies, conducting vigil functions, enhancing online search engines, enhancing road safety, managing customer relationships, strengthening the connection between machines and humans, and numerous other domains. Many researchers have used diverse methodologies, such as the integration of different attributes, the use of different feature selection techniques, and designed a hybrid or complex model using more than one classifier, to augment the effectiveness of emotion classification. In our study, we used a novel technique called the fixed frequency range empirical wavelet transform (FFREWT) filter bank decomposition method to extract the features, and then used those features to accurately identify each and every emotion in the speech signal. The FFREWT filter bank method segments the speech signal frame (SSF) into many sub-signals or modes. We used each FFREWT-based decomposed mode to get features like the mel frequency cepstral coefficient (MFCC), approximate entropy (ApEn), permutation entropy (PrEn), and increment entropy (IrEn). We then used the different combinations of the proposed FFREWT-based feature sets and the deep neural network (DNN) classifier to classify the speech emotion. Our proposed method helps to achieve an emotion classification accuracy of 89.35%, 84.69%, and 100% using the combinations of the proposed FFREWT-based feature (MFCC + ApEn + PrEn + IrEn) for the EMO-DB, EMOVO, and TESS datasets, respectively. Our experimental results were compared with the other methods, and we found that the proposed FFREWT-based feature combinations with a DNN classifier performed better than the state-of-the-art methods in SER.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"166 ","pages":"Article 103148"},"PeriodicalIF":2.4000,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639324001195","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0
Abstract
The primary goal of automated speech emotion recognition (SER) is to accurately and effectively identify each specific emotion conveyed in a speech signal utilizing machines such as computers and mobile devices. The widespread recognition of the popularity of SER among academics for three decades is mainly attributed to its broad application in practical scenarios. The utilization of SER has proven to be beneficial in various fields, such as medical intervention, bolstering safety strategies, conducting vigil functions, enhancing online search engines, enhancing road safety, managing customer relationships, strengthening the connection between machines and humans, and numerous other domains. Many researchers have used diverse methodologies, such as the integration of different attributes, the use of different feature selection techniques, and designed a hybrid or complex model using more than one classifier, to augment the effectiveness of emotion classification. In our study, we used a novel technique called the fixed frequency range empirical wavelet transform (FFREWT) filter bank decomposition method to extract the features, and then used those features to accurately identify each and every emotion in the speech signal. The FFREWT filter bank method segments the speech signal frame (SSF) into many sub-signals or modes. We used each FFREWT-based decomposed mode to get features like the mel frequency cepstral coefficient (MFCC), approximate entropy (ApEn), permutation entropy (PrEn), and increment entropy (IrEn). We then used the different combinations of the proposed FFREWT-based feature sets and the deep neural network (DNN) classifier to classify the speech emotion. Our proposed method helps to achieve an emotion classification accuracy of 89.35%, 84.69%, and 100% using the combinations of the proposed FFREWT-based feature (MFCC + ApEn + PrEn + IrEn) for the EMO-DB, EMOVO, and TESS datasets, respectively. Our experimental results were compared with the other methods, and we found that the proposed FFREWT-based feature combinations with a DNN classifier performed better than the state-of-the-art methods in SER.
期刊介绍:
Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results.
The journal''s primary objectives are:
• to present a forum for the advancement of human and human-machine speech communication science;
• to stimulate cross-fertilization between different fields of this domain;
• to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.