通过使用不同的音频特征,机器学习和深度学习算法从语音中识别情感

Alperen Sayar, Tuna Çakar, Tunahan Bozkan, Seyit Ertugrul, Fatma Gümüş
{"title":"通过使用不同的音频特征,机器学习和深度学习算法从语音中识别情感","authors":"Alperen Sayar, Tuna Çakar, Tunahan Bozkan, Seyit Ertugrul, Fatma Gümüş","doi":"10.54941/ahfe1003279","DOIUrl":null,"url":null,"abstract":"Speech has been accepted as one of the basic, efficient and powerful communication methods. At the beginning of the 20th century, electroacoustic analysis was used for determining emotions in psychology. In academics, Speech Emotion Recognition (SER) has become one of the most studied and investigated research areas. This research program aims to determine the emotional state of the speaker based on speech signals. Significant studies have been undertaken during the last two decades to identify emotions from speech by using machine learning. However, it is still a challenging task because emotions rotate from one to another and there are environmental factors which have significant effects on emotions. Furthermore, sound consists of numerous parameters and there are various anatomical characteristics to take into consideration. Determining an appropriate audio feature set for emotion recognition is still a critical decision point for an emotion recognition system. The demand for voice technology in both art and human – machine interaction systems has recently been increased. Our voice conveys both linguistic and paralinguistic messages in the course of speaking. The paralinguistic part, for example, rhythm and pitch, provides emotional cues to the speaker. The speech emotion recognition topic examines the question ‘How is it said?’ and an algorithm detects the emotional state of the speaker from an audio record. Although a considerable number of the studies have been conducted for selecting and extracting an optimal set of features, appropriate attributes for automatic emotion recognition from audio are still under research. The main aim of this study is obtaining the most distinctive emotional audio features. For this purpose, time- based features, frequency-based features and spectral shape-based features are used for comparing recognition accuracies. Besides these features, a pre-trained model is used for obtaining input for emotion recognition. Machine learning models are developed for classifying emotions with Support Vector Machine, Multi-Layer Perceptron and Convolutional Neural Network algorithms. Three emotional databases in English and German are combined and a larger database is obtained for training and testing the models. Emotions namely, Happy, Calm, Angry, Boredom, Disgust, Fear, Neutral, Sad and Surprised are classified with these models. When the classification results are examined, it is concluded that the pre- trained representations make the most successful predictions. The weighted accuracy ratio is 91% for both Convolutional Neural Network and Multilayer Perceptron algorithms while this ratio is 87% for the Support Vector Machine algorithm. A hybrid model is being developed which contains both a pre-trained model and spectral shaped based features. Speech contains silent and noisy sections which increase the computational complexity. Time performance is the other major factor which should be a great deal of careful consideration. Although there are many advancements on SER, custom architectures are designed to fuse accuracy and time performance. Even further for a more realistic emotion estimation all physical gestures like voice, body parts of movement and facial expression can be obtained together as humans use them collectively to express themselves.","PeriodicalId":405313,"journal":{"name":"Artificial Intelligence and Social Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Emotion Recognition from Speech via the Use of Different Audio Features, Machine Learning and Deep Learning Algorithms\",\"authors\":\"Alperen Sayar, Tuna Çakar, Tunahan Bozkan, Seyit Ertugrul, Fatma Gümüş\",\"doi\":\"10.54941/ahfe1003279\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Speech has been accepted as one of the basic, efficient and powerful communication methods. At the beginning of the 20th century, electroacoustic analysis was used for determining emotions in psychology. In academics, Speech Emotion Recognition (SER) has become one of the most studied and investigated research areas. This research program aims to determine the emotional state of the speaker based on speech signals. Significant studies have been undertaken during the last two decades to identify emotions from speech by using machine learning. However, it is still a challenging task because emotions rotate from one to another and there are environmental factors which have significant effects on emotions. Furthermore, sound consists of numerous parameters and there are various anatomical characteristics to take into consideration. Determining an appropriate audio feature set for emotion recognition is still a critical decision point for an emotion recognition system. The demand for voice technology in both art and human – machine interaction systems has recently been increased. Our voice conveys both linguistic and paralinguistic messages in the course of speaking. The paralinguistic part, for example, rhythm and pitch, provides emotional cues to the speaker. The speech emotion recognition topic examines the question ‘How is it said?’ and an algorithm detects the emotional state of the speaker from an audio record. Although a considerable number of the studies have been conducted for selecting and extracting an optimal set of features, appropriate attributes for automatic emotion recognition from audio are still under research. The main aim of this study is obtaining the most distinctive emotional audio features. For this purpose, time- based features, frequency-based features and spectral shape-based features are used for comparing recognition accuracies. Besides these features, a pre-trained model is used for obtaining input for emotion recognition. Machine learning models are developed for classifying emotions with Support Vector Machine, Multi-Layer Perceptron and Convolutional Neural Network algorithms. Three emotional databases in English and German are combined and a larger database is obtained for training and testing the models. Emotions namely, Happy, Calm, Angry, Boredom, Disgust, Fear, Neutral, Sad and Surprised are classified with these models. When the classification results are examined, it is concluded that the pre- trained representations make the most successful predictions. The weighted accuracy ratio is 91% for both Convolutional Neural Network and Multilayer Perceptron algorithms while this ratio is 87% for the Support Vector Machine algorithm. A hybrid model is being developed which contains both a pre-trained model and spectral shaped based features. Speech contains silent and noisy sections which increase the computational complexity. Time performance is the other major factor which should be a great deal of careful consideration. Although there are many advancements on SER, custom architectures are designed to fuse accuracy and time performance. Even further for a more realistic emotion estimation all physical gestures like voice, body parts of movement and facial expression can be obtained together as humans use them collectively to express themselves.\",\"PeriodicalId\":405313,\"journal\":{\"name\":\"Artificial Intelligence and Social Computing\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Artificial Intelligence and Social Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.54941/ahfe1003279\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial Intelligence and Social Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.54941/ahfe1003279","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

言语作为一种基本的、有效的、强有力的交际方式,已被人们所接受。20世纪初,电声分析在心理学上被用于确定情绪。在学术界,语音情感识别(SER)已成为研究和研究最多的研究领域之一。本研究项目旨在根据语音信号判断说话人的情绪状态。在过去的二十年里,人们进行了大量的研究,利用机器学习从语音中识别情绪。然而,这仍然是一项具有挑战性的任务,因为情绪会从一个人到另一个人,而且环境因素对情绪有重大影响。此外,声音由许多参数组成,需要考虑各种解剖特征。为情感识别确定合适的音频特征集仍然是情感识别系统的关键决策点。在艺术和人机交互系统中对语音技术的需求最近有所增加。在说话的过程中,我们的声音既传达了语言信息,也传达了副语言信息。副语言部分,例如节奏和音高,为说话者提供情感线索。语音情感识别主题研究的是“它是怎么说的?”,一种算法会从音频记录中检测说话者的情绪状态。尽管已经进行了大量的研究来选择和提取最优的特征集,但用于音频情感自动识别的适当属性仍在研究中。本研究的主要目的是获得最具特色的情感音频特征。为此,采用基于时间的特征、基于频率的特征和基于频谱形状的特征来比较识别精度。除了这些特征外,还使用预训练模型来获取情感识别的输入。使用支持向量机、多层感知机和卷积神经网络算法开发了用于情绪分类的机器学习模型。结合英语和德语三个情感数据库,得到一个更大的数据库用于模型的训练和测试。情绪,即快乐,平静,愤怒,无聊,厌恶,恐惧,中性,悲伤和惊讶被分类在这些模型中。当对分类结果进行检查时,我们得出结论,预训练的表征做出了最成功的预测。卷积神经网络和多层感知机算法的加权准确率均为91%,而支持向量机算法的加权准确率为87%。目前正在开发一种混合模型,其中包含预训练模型和基于谱形的特征。语音包含无声和嘈杂的部分,这增加了计算复杂度。时间表现是另一个需要仔细考虑的主要因素。尽管在SER上有许多进步,但定制架构的设计是为了融合准确性和时间性能。更进一步,为了更真实的情感估计,所有的身体手势,如声音,运动的身体部位和面部表情都可以一起获得,因为人类使用它们来集体表达自己。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Emotion Recognition from Speech via the Use of Different Audio Features, Machine Learning and Deep Learning Algorithms
Speech has been accepted as one of the basic, efficient and powerful communication methods. At the beginning of the 20th century, electroacoustic analysis was used for determining emotions in psychology. In academics, Speech Emotion Recognition (SER) has become one of the most studied and investigated research areas. This research program aims to determine the emotional state of the speaker based on speech signals. Significant studies have been undertaken during the last two decades to identify emotions from speech by using machine learning. However, it is still a challenging task because emotions rotate from one to another and there are environmental factors which have significant effects on emotions. Furthermore, sound consists of numerous parameters and there are various anatomical characteristics to take into consideration. Determining an appropriate audio feature set for emotion recognition is still a critical decision point for an emotion recognition system. The demand for voice technology in both art and human – machine interaction systems has recently been increased. Our voice conveys both linguistic and paralinguistic messages in the course of speaking. The paralinguistic part, for example, rhythm and pitch, provides emotional cues to the speaker. The speech emotion recognition topic examines the question ‘How is it said?’ and an algorithm detects the emotional state of the speaker from an audio record. Although a considerable number of the studies have been conducted for selecting and extracting an optimal set of features, appropriate attributes for automatic emotion recognition from audio are still under research. The main aim of this study is obtaining the most distinctive emotional audio features. For this purpose, time- based features, frequency-based features and spectral shape-based features are used for comparing recognition accuracies. Besides these features, a pre-trained model is used for obtaining input for emotion recognition. Machine learning models are developed for classifying emotions with Support Vector Machine, Multi-Layer Perceptron and Convolutional Neural Network algorithms. Three emotional databases in English and German are combined and a larger database is obtained for training and testing the models. Emotions namely, Happy, Calm, Angry, Boredom, Disgust, Fear, Neutral, Sad and Surprised are classified with these models. When the classification results are examined, it is concluded that the pre- trained representations make the most successful predictions. The weighted accuracy ratio is 91% for both Convolutional Neural Network and Multilayer Perceptron algorithms while this ratio is 87% for the Support Vector Machine algorithm. A hybrid model is being developed which contains both a pre-trained model and spectral shaped based features. Speech contains silent and noisy sections which increase the computational complexity. Time performance is the other major factor which should be a great deal of careful consideration. Although there are many advancements on SER, custom architectures are designed to fuse accuracy and time performance. Even further for a more realistic emotion estimation all physical gestures like voice, body parts of movement and facial expression can be obtained together as humans use them collectively to express themselves.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Hepatitis predictive analysis model through deep learning using neural networks based on patient history A machine learning approach for optimizing waiting times in a hand surgery operation center Automated Decision Support for Collaborative, Interactive Classification Dynamically monitoring crowd-worker's reliability with interval-valued labels Detection of inappropriate images on smartphones based on computer vision techniques
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1