Nuwan Pallewela, D. Alahakoon, A. Adikari, John E. Pierce, ML Rose
{"title":"Optimizing Speech Emotion Recognition with Machine Learning Based Advanced Audio Cue Analysis","authors":"Nuwan Pallewela, D. Alahakoon, A. Adikari, John E. Pierce, ML Rose","doi":"10.3390/technologies12070111","DOIUrl":null,"url":null,"abstract":"In today’s fast-paced and interconnected world, where human–computer interaction is an integral component of daily life, the ability to recognize and understand human emotions has emerged as a crucial facet of technological advancement. However, human emotion, a complex interplay of physiological, psychological, and social factors, poses a formidable challenge even for other humans to comprehend accurately. With the emergence of voice assistants and other speech-based applications, it has become essential to improve audio-based emotion expression. However, there is a lack of specificity and agreement in current emotion annotation practice, as evidenced by conflicting labels in many human-annotated emotional datasets for the same speech segments. Previous studies have had to filter out these conflicts and, therefore, a large portion of the collected data has been considered unusable. In this study, we aimed to improve the accuracy of computational prediction of uncertain emotion labels by utilizing high-confidence emotion labelled speech segments from the IEMOCAP emotion dataset. We implemented an audio-based emotion recognition model using bag of audio word encoding (BoAW) to obtain a representation of audio aspects of emotion in speech with state-of-the-art recurrent neural network models. Our approach improved the state-of-the-art audio-based emotion recognition with a 61.09% accuracy rate, an improvement of 1.02% over the BiDialogueRNN model and 1.72% over the EmoCaps multi-modal emotion recognition models. In comparison to human annotation, our approach achieved similar results in identifying positive and negative emotions. Furthermore, it has proven effective in accurately recognizing the sentiment of uncertain emotion segments that were previously considered unusable in other studies. Improvements in audio emotion recognition could have implications in voice-based assistants, healthcare, and other industrial applications that benefit from automated communication.","PeriodicalId":504839,"journal":{"name":"Technologies","volume":"134 9","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/technologies12070111","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In today’s fast-paced and interconnected world, where human–computer interaction is an integral component of daily life, the ability to recognize and understand human emotions has emerged as a crucial facet of technological advancement. However, human emotion, a complex interplay of physiological, psychological, and social factors, poses a formidable challenge even for other humans to comprehend accurately. With the emergence of voice assistants and other speech-based applications, it has become essential to improve audio-based emotion expression. However, there is a lack of specificity and agreement in current emotion annotation practice, as evidenced by conflicting labels in many human-annotated emotional datasets for the same speech segments. Previous studies have had to filter out these conflicts and, therefore, a large portion of the collected data has been considered unusable. In this study, we aimed to improve the accuracy of computational prediction of uncertain emotion labels by utilizing high-confidence emotion labelled speech segments from the IEMOCAP emotion dataset. We implemented an audio-based emotion recognition model using bag of audio word encoding (BoAW) to obtain a representation of audio aspects of emotion in speech with state-of-the-art recurrent neural network models. Our approach improved the state-of-the-art audio-based emotion recognition with a 61.09% accuracy rate, an improvement of 1.02% over the BiDialogueRNN model and 1.72% over the EmoCaps multi-modal emotion recognition models. In comparison to human annotation, our approach achieved similar results in identifying positive and negative emotions. Furthermore, it has proven effective in accurately recognizing the sentiment of uncertain emotion segments that were previously considered unusable in other studies. Improvements in audio emotion recognition could have implications in voice-based assistants, healthcare, and other industrial applications that benefit from automated communication.