基于k均值聚类和时空CNN的视听情感识别

Masoumeh Sharafi, M. Yazdchi, J. Rasti
{"title":"基于k均值聚类和时空CNN的视听情感识别","authors":"Masoumeh Sharafi, M. Yazdchi, J. Rasti","doi":"10.1109/IPRIA59240.2023.10147192","DOIUrl":null,"url":null,"abstract":"Emotion recognition is a challenging task due to the emotional gap between subjective feeling and low-level audio-visual characteristics. Thus, the development of a feasible approach for high-performance emotion recognition might enhance human-computer interaction. Deep learning methods have enhanced the performance of emotion recognition systems in comparison to other current methods. In this paper, a multimodal deep convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM) network are proposed, which fuses the audio and visual cues in a deep model. The spatial and temporal features extracted from video frames are fused with short-term Fourier transform (STFT) extracted from audio signals. Finally, a Softmax classifier is used to classify inputs into seven groups: anger, disgust, fear, happiness, sadness, surprise, and neutral mode. The proposed model is evaluated on Surrey Audio-Visual Expressed Emotion (SAVEE) database with an accuracy of 95.48%. Our experimental study reveals that the suggested method is more effective than existing algorithms in adapting to emotion recognition in this dataset.","PeriodicalId":109390,"journal":{"name":"2023 6th International Conference on Pattern Recognition and Image Analysis (IPRIA)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Audio-Visual Emotion Recognition Using K-Means Clustering and Spatio-Temporal CNN\",\"authors\":\"Masoumeh Sharafi, M. Yazdchi, J. Rasti\",\"doi\":\"10.1109/IPRIA59240.2023.10147192\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Emotion recognition is a challenging task due to the emotional gap between subjective feeling and low-level audio-visual characteristics. Thus, the development of a feasible approach for high-performance emotion recognition might enhance human-computer interaction. Deep learning methods have enhanced the performance of emotion recognition systems in comparison to other current methods. In this paper, a multimodal deep convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM) network are proposed, which fuses the audio and visual cues in a deep model. The spatial and temporal features extracted from video frames are fused with short-term Fourier transform (STFT) extracted from audio signals. Finally, a Softmax classifier is used to classify inputs into seven groups: anger, disgust, fear, happiness, sadness, surprise, and neutral mode. The proposed model is evaluated on Surrey Audio-Visual Expressed Emotion (SAVEE) database with an accuracy of 95.48%. Our experimental study reveals that the suggested method is more effective than existing algorithms in adapting to emotion recognition in this dataset.\",\"PeriodicalId\":109390,\"journal\":{\"name\":\"2023 6th International Conference on Pattern Recognition and Image Analysis (IPRIA)\",\"volume\":\"24 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 6th International Conference on Pattern Recognition and Image Analysis (IPRIA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPRIA59240.2023.10147192\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 6th International Conference on Pattern Recognition and Image Analysis (IPRIA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPRIA59240.2023.10147192","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

情感识别是一项具有挑战性的任务,因为主观感受与低层次视听特征之间存在情感差距。因此,开发一种可行的高性能情感识别方法可能会增强人机交互。与其他现有方法相比,深度学习方法提高了情绪识别系统的性能。本文提出了一种多模态深度卷积神经网络(CNN)和双向长短期记忆(BiLSTM)网络,将音频和视觉线索融合在一个深度模型中。将从视频帧中提取的时空特征与从音频信号中提取的短时傅里叶变换(STFT)相融合。最后,使用Softmax分类器将输入分为七组:愤怒、厌恶、恐惧、快乐、悲伤、惊讶和中性模式。在Surrey视听表达情感数据库(SAVEE)上对该模型进行了评价,准确率达到95.48%。我们的实验研究表明,该方法比现有算法更有效地适应该数据集的情绪识别。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Audio-Visual Emotion Recognition Using K-Means Clustering and Spatio-Temporal CNN
Emotion recognition is a challenging task due to the emotional gap between subjective feeling and low-level audio-visual characteristics. Thus, the development of a feasible approach for high-performance emotion recognition might enhance human-computer interaction. Deep learning methods have enhanced the performance of emotion recognition systems in comparison to other current methods. In this paper, a multimodal deep convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM) network are proposed, which fuses the audio and visual cues in a deep model. The spatial and temporal features extracted from video frames are fused with short-term Fourier transform (STFT) extracted from audio signals. Finally, a Softmax classifier is used to classify inputs into seven groups: anger, disgust, fear, happiness, sadness, surprise, and neutral mode. The proposed model is evaluated on Surrey Audio-Visual Expressed Emotion (SAVEE) database with an accuracy of 95.48%. Our experimental study reveals that the suggested method is more effective than existing algorithms in adapting to emotion recognition in this dataset.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Classification of Rice Leaf Diseases Using CNN-Based Pre-Trained Models and Transfer Learning Quality Assessment of Screen Content Videos 3D Image Annotation using Deep Learning and View-based Image Features Machine Learning Techniques During the COVID-19 Pandemic: A Bibliometric Analysis Audio-Visual Emotion Recognition Using K-Means Clustering and Spatio-Temporal CNN
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1