Deep learning model for simultaneous recognition of quantitative and qualitative emotion using visual and bio-sensing data

IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computer Vision and Image Understanding Pub Date : 2024-08-22 DOI:10.1016/j.cviu.2024.104121
Iman Hosseini , Md Zakir Hossain , Yuhao Zhang , Shafin Rahman
{"title":"Deep learning model for simultaneous recognition of quantitative and qualitative emotion using visual and bio-sensing data","authors":"Iman Hosseini ,&nbsp;Md Zakir Hossain ,&nbsp;Yuhao Zhang ,&nbsp;Shafin Rahman","doi":"10.1016/j.cviu.2024.104121","DOIUrl":null,"url":null,"abstract":"<div><p>The recognition of emotions heavily relies on important factors such as human facial expressions and physiological signals, including electroencephalogram and electrocardiogram. In literature, emotion recognition is investigated quantitatively (while estimating valance, arousal, and dominance) and qualitatively (while predicting discrete emotions like happiness, sadness, anger, surprise, and so on). Current methods utilize a combination of visual data and bio-sensing information to create recognition systems that incorporate multiple modes (quantitative/qualitative). Nevertheless, these methods necessitate extensive expertise in specific domains and intricate preprocessing procedures, and consequently, they are unable to fully leverage the inherent advantages of end-to-end deep learning techniques. Moreover, methods usually aim to recognize either qualitative or quantitative emotions. Although both kinds of emotions are significantly co-related, previous methods do not simultaneously recognize qualitative and quantitative emotions. In this paper, a novel deep end-to-end framework named DeepVADNet is introduced, specifically designed for the purpose of multi-modal emotion recognition. The proposed framework leverages deep learning techniques to effectively extract crucial face appearance features as well as bio-sensing features, predicting both qualitative and quantitative emotions in a single forward pass. In this study, we employ the CRNN architecture to extract face appearance features, while the ConvLSTM model is utilized to extract spatio-temporal information from visual data (videos). Additionally, we utilize the Conv1D model for processing physiological signals (EEG, EOG, ECG, and GSR) as this approach deviates from conventional manual techniques that involve traditional manual methods for extracting features based on time and frequency domains. After enhancing the feature quality by fusing both modalities, we use a novel method employing quantitative emotion to predict qualitative emotions accurately. We perform extensive experiments on the DEAP and MAHNOB-HCI datasets, achieving state-of-the-art quantitative emotion recognition results of 98.93%/6e-4 and 89.08%/0.97 (mean classification accuracy/MSE) in both datasets, respectively. Also, for the qualitative emotion recognition task, we achieve 82.71% mean classification accuracy on the MAHNOB-HCI dataset. The code and evaluation can be accessed at: <span><span>https://github.com/I-Man-H/DeepVADNet.git</span><svg><path></path></svg></span></p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"248 ","pages":"Article 104121"},"PeriodicalIF":4.3000,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314224002029","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

The recognition of emotions heavily relies on important factors such as human facial expressions and physiological signals, including electroencephalogram and electrocardiogram. In literature, emotion recognition is investigated quantitatively (while estimating valance, arousal, and dominance) and qualitatively (while predicting discrete emotions like happiness, sadness, anger, surprise, and so on). Current methods utilize a combination of visual data and bio-sensing information to create recognition systems that incorporate multiple modes (quantitative/qualitative). Nevertheless, these methods necessitate extensive expertise in specific domains and intricate preprocessing procedures, and consequently, they are unable to fully leverage the inherent advantages of end-to-end deep learning techniques. Moreover, methods usually aim to recognize either qualitative or quantitative emotions. Although both kinds of emotions are significantly co-related, previous methods do not simultaneously recognize qualitative and quantitative emotions. In this paper, a novel deep end-to-end framework named DeepVADNet is introduced, specifically designed for the purpose of multi-modal emotion recognition. The proposed framework leverages deep learning techniques to effectively extract crucial face appearance features as well as bio-sensing features, predicting both qualitative and quantitative emotions in a single forward pass. In this study, we employ the CRNN architecture to extract face appearance features, while the ConvLSTM model is utilized to extract spatio-temporal information from visual data (videos). Additionally, we utilize the Conv1D model for processing physiological signals (EEG, EOG, ECG, and GSR) as this approach deviates from conventional manual techniques that involve traditional manual methods for extracting features based on time and frequency domains. After enhancing the feature quality by fusing both modalities, we use a novel method employing quantitative emotion to predict qualitative emotions accurately. We perform extensive experiments on the DEAP and MAHNOB-HCI datasets, achieving state-of-the-art quantitative emotion recognition results of 98.93%/6e-4 and 89.08%/0.97 (mean classification accuracy/MSE) in both datasets, respectively. Also, for the qualitative emotion recognition task, we achieve 82.71% mean classification accuracy on the MAHNOB-HCI dataset. The code and evaluation can be accessed at: https://github.com/I-Man-H/DeepVADNet.git

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用视觉和生物传感数据同时识别定量和定性情绪的深度学习模型
情绪识别在很大程度上依赖于人的面部表情和包括脑电图和心电图在内的生理信号等重要因素。文献对情绪识别进行了定量研究(同时估算情绪价值、唤醒程度和主导地位)和定性研究(同时预测离散情绪,如快乐、悲伤、愤怒、惊讶等)。目前的方法是将视觉数据和生物传感信息结合起来,创建包含多种模式(定量/定性)的识别系统。然而,这些方法需要特定领域的广泛专业知识和复杂的预处理程序,因此无法充分利用端到端深度学习技术的固有优势。此外,这些方法通常旨在识别定性或定量情绪。虽然这两种情绪有很大的共通性,但以往的方法并不能同时识别定性和定量情绪。本文介绍了一种名为 DeepVADNet 的新型深度端到端框架,它是专门为多模态情感识别而设计的。该框架利用深度学习技术有效地提取了关键的人脸外观特征和生物传感特征,只需一次前向传递即可预测定性和定量情绪。在本研究中,我们采用 CRNN 架构提取人脸外观特征,同时利用 ConvLSTM 模型从视觉数据(视频)中提取时空信息。此外,我们还利用 Conv1D 模型来处理生理信号(脑电图、眼电图、心电图和 GSR),因为这种方法不同于传统的人工技术,传统的人工技术是基于时域和频域提取特征。在通过融合两种模式提高特征质量之后,我们采用了一种新方法,利用定量情绪来准确预测定性情绪。我们在 DEAP 和 MAHNOB-HCI 数据集上进行了大量实验,在这两个数据集上分别取得了 98.93%/6e-4 和 89.08%/0.97 (平均分类准确率/MSE)的一流定量情绪识别结果。此外,在定性情感识别任务中,我们在 MAHNOB-HCI 数据集上取得了 82.71% 的平均分类准确率。代码和评估可从以下网址获取: https://github.com/I-Man-H/DeepVADNet.git
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Computer Vision and Image Understanding
Computer Vision and Image Understanding 工程技术-工程:电子与电气
CiteScore
7.80
自引率
4.40%
发文量
112
审稿时长
79 days
期刊介绍: The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems
期刊最新文献
Scene-cGAN: A GAN for underwater restoration and scene depth estimation 2S-SGCN: A two-stage stratified graph convolutional network model for facial landmark detection on 3D data Dual stage semantic information based generative adversarial network for image super-resolution Enhancing scene text detectors with realistic text image synthesis using diffusion models Unsupervised co-generation of foreground–background segmentation from Text-to-Image synthesis
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1