Multimodal Group Emotion Recognition In-the-wild Using Privacy-Compliant Features

Anderson Augusma, Dominique Vaufreydaz, Frédérique Letué
{"title":"Multimodal Group Emotion Recognition In-the-wild Using Privacy-Compliant Features","authors":"Anderson Augusma, Dominique Vaufreydaz, Frédérique Letué","doi":"10.1145/3577190.3616546","DOIUrl":null,"url":null,"abstract":"This paper explores privacy-compliant group-level emotion recognition \"in-the-wild\" within the EmotiW Challenge 2023. Group-level emotion recognition can be useful in many fields including social robotics, conversational agents, e-coaching and learning analytics. This research imposes itself using only global features avoiding individual ones, i.e. all features that can be used to identify or track people in videos (facial landmarks, body poses, audio diarization, etc.). The proposed multimodal model is composed of a video and an audio branches with a cross-attention between modalities. The video branch is based on a fine-tuned ViT architecture. The audio branch extracts Mel-spectrograms and feed them through CNN blocks into a transformer encoder. Our training paradigm includes a generated synthetic dataset to increase the sensitivity of our model on facial expression within the image in a data-driven way. The extensive experiments show the significance of our methodology. Our privacy-compliant proposal performs fairly on the EmotiW challenge, with 79.24% and 75.13% of accuracy respectively on validation and test set for the best models. Noticeably, our findings highlight that it is possible to reach this accuracy level with privacy-compliant features using only 5 frames uniformly distributed on the video.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"118 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Companion Publication of the 2020 International Conference on Multimodal Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3577190.3616546","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

This paper explores privacy-compliant group-level emotion recognition "in-the-wild" within the EmotiW Challenge 2023. Group-level emotion recognition can be useful in many fields including social robotics, conversational agents, e-coaching and learning analytics. This research imposes itself using only global features avoiding individual ones, i.e. all features that can be used to identify or track people in videos (facial landmarks, body poses, audio diarization, etc.). The proposed multimodal model is composed of a video and an audio branches with a cross-attention between modalities. The video branch is based on a fine-tuned ViT architecture. The audio branch extracts Mel-spectrograms and feed them through CNN blocks into a transformer encoder. Our training paradigm includes a generated synthetic dataset to increase the sensitivity of our model on facial expression within the image in a data-driven way. The extensive experiments show the significance of our methodology. Our privacy-compliant proposal performs fairly on the EmotiW challenge, with 79.24% and 75.13% of accuracy respectively on validation and test set for the best models. Noticeably, our findings highlight that it is possible to reach this accuracy level with privacy-compliant features using only 5 frames uniformly distributed on the video.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
使用隐私兼容特征的野外多模态群体情感识别
本文探讨了在EmotiW挑战2023中符合隐私的群体级情感识别“野外”。群体层面的情感识别在许多领域都很有用,包括社交机器人、会话代理、电子教练和学习分析。本研究只使用全局特征,避免使用个体特征,即所有可用于识别或跟踪视频中的人的特征(面部地标,身体姿势,音频化等)。所提出的多模态模型由一个视频和一个音频分支组成,在模态之间具有交叉关注。视频分支基于经过微调的ViT架构。音频分支提取mel频谱图,并将其通过CNN块馈送到变压器编码器。我们的训练范例包括一个生成的合成数据集,以数据驱动的方式提高我们的模型对图像中面部表情的敏感性。大量的实验表明了我们的方法的意义。我们的隐私兼容提案在EmotiW挑战中表现相当好,对于最佳模型,验证集和测试集的准确率分别为79.24%和75.13%。值得注意的是,我们的研究结果强调,仅使用均匀分布在视频上的5帧,就可以通过隐私兼容功能达到这种精度水平。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Gesture Motion Graphs for Few-Shot Speech-Driven Gesture Reenactment The UEA Digital Humans entry to the GENEA Challenge 2023 Deciphering Entrepreneurial Pitches: A Multimodal Deep Learning Approach to Predict Probability of Investment The FineMotion entry to the GENEA Challenge 2023: DeepPhase for conversational gestures generation FEIN-Z: Autoregressive Behavior Cloning for Speech-Driven Gesture Generation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1