{"title":"Speech emotion recognition based on Gaussian Mixture Models and Deep Neural Networks","authors":"I. Tashev, Zhong-Qiu Wang, Keith W. Godin","doi":"10.1109/ITA.2017.8023477","DOIUrl":null,"url":null,"abstract":"Recognition of speaker emotion during interaction in spoken dialog systems can enhance the user experience, and provide system operators with information valuable to ongoing assessment of interaction system performance and utility. Interaction utterances are very short, and we assume the speaker's emotion is constant throughout a given utterance. This paper investigates combinations of a GMM-based low-level feature extractor with a neural network serving as a high level feature extractor. The advantage of this system architecture is that it combines the fast developing neural network-based solutions with the classic statistical approaches applied to emotion recognition. Experiments on a Mandarin data set compare different solutions under the same or close conditions.","PeriodicalId":305510,"journal":{"name":"2017 Information Theory and Applications Workshop (ITA)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 Information Theory and Applications Workshop (ITA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ITA.2017.8023477","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 24
Abstract
Recognition of speaker emotion during interaction in spoken dialog systems can enhance the user experience, and provide system operators with information valuable to ongoing assessment of interaction system performance and utility. Interaction utterances are very short, and we assume the speaker's emotion is constant throughout a given utterance. This paper investigates combinations of a GMM-based low-level feature extractor with a neural network serving as a high level feature extractor. The advantage of this system architecture is that it combines the fast developing neural network-based solutions with the classic statistical approaches applied to emotion recognition. Experiments on a Mandarin data set compare different solutions under the same or close conditions.