Multimodal Group Emotion Recognition In-the-wild Using Privacy-Compliant Features

Companion Publication of the 2020 International Conference on Multimodal Interaction Pub Date : 2023-10-09 DOI:10.1145/3577190.3616546

Anderson Augusma, Dominique Vaufreydaz, Frédérique Letué

{"title":"Multimodal Group Emotion Recognition In-the-wild Using Privacy-Compliant Features","authors":"Anderson Augusma, Dominique Vaufreydaz, Frédérique Letué","doi":"10.1145/3577190.3616546","DOIUrl":null,"url":null,"abstract":"This paper explores privacy-compliant group-level emotion recognition \"in-the-wild\" within the EmotiW Challenge 2023. Group-level emotion recognition can be useful in many fields including social robotics, conversational agents, e-coaching and learning analytics. This research imposes itself using only global features avoiding individual ones, i.e. all features that can be used to identify or track people in videos (facial landmarks, body poses, audio diarization, etc.). The proposed multimodal model is composed of a video and an audio branches with a cross-attention between modalities. The video branch is based on a fine-tuned ViT architecture. The audio branch extracts Mel-spectrograms and feed them through CNN blocks into a transformer encoder. Our training paradigm includes a generated synthetic dataset to increase the sensitivity of our model on facial expression within the image in a data-driven way. The extensive experiments show the significance of our methodology. Our privacy-compliant proposal performs fairly on the EmotiW challenge, with 79.24% and 75.13% of accuracy respectively on validation and test set for the best models. Noticeably, our findings highlight that it is possible to reach this accuracy level with privacy-compliant features using only 5 frames uniformly distributed on the video.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"118 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Companion Publication of the 2020 International Conference on Multimodal Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3577190.3616546","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

This paper explores privacy-compliant group-level emotion recognition "in-the-wild" within the EmotiW Challenge 2023. Group-level emotion recognition can be useful in many fields including social robotics, conversational agents, e-coaching and learning analytics. This research imposes itself using only global features avoiding individual ones, i.e. all features that can be used to identify or track people in videos (facial landmarks, body poses, audio diarization, etc.). The proposed multimodal model is composed of a video and an audio branches with a cross-attention between modalities. The video branch is based on a fine-tuned ViT architecture. The audio branch extracts Mel-spectrograms and feed them through CNN blocks into a transformer encoder. Our training paradigm includes a generated synthetic dataset to increase the sensitivity of our model on facial expression within the image in a data-driven way. The extensive experiments show the significance of our methodology. Our privacy-compliant proposal performs fairly on the EmotiW challenge, with 79.24% and 75.13% of accuracy respectively on validation and test set for the best models. Noticeably, our findings highlight that it is possible to reach this accuracy level with privacy-compliant features using only 5 frames uniformly distributed on the video.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用隐私兼容特征的野外多模态群体情感识别

本文探讨了在EmotiW挑战2023中符合隐私的群体级情感识别“野外”。群体层面的情感识别在许多领域都很有用，包括社交机器人、会话代理、电子教练和学习分析。本研究只使用全局特征，避免使用个体特征，即所有可用于识别或跟踪视频中的人的特征(面部地标，身体姿势，音频化等)。所提出的多模态模型由一个视频和一个音频分支组成，在模态之间具有交叉关注。视频分支基于经过微调的ViT架构。音频分支提取mel频谱图，并将其通过CNN块馈送到变压器编码器。我们的训练范例包括一个生成的合成数据集，以数据驱动的方式提高我们的模型对图像中面部表情的敏感性。大量的实验表明了我们的方法的意义。我们的隐私兼容提案在EmotiW挑战中表现相当好，对于最佳模型，验证集和测试集的准确率分别为79.24%和75.13%。值得注意的是，我们的研究结果强调，仅使用均匀分布在视频上的5帧，就可以通过隐私兼容功能达到这种精度水平。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Companion Publication of the 2020 International Conference on Multimodal Interaction

自引率

0.00%

发文量