Audio-Visual Group-based Emotion Recognition using Local and Global Feature Aggregation based Multi-Task Learning

Companion Publication of the 2020 International Conference on Multimodal Interaction Pub Date : 2023-10-09 DOI:10.1145/3577190.3616544

Sunan Li, Hailun Lian, Cheng Lu, Yan Zhao, Chuangao Tang, Yuan Zong, Wenming Zheng

{"title":"Audio-Visual Group-based Emotion Recognition using Local and Global Feature Aggregation based Multi-Task Learning","authors":"Sunan Li, Hailun Lian, Cheng Lu, Yan Zhao, Chuangao Tang, Yuan Zong, Wenming Zheng","doi":"10.1145/3577190.3616544","DOIUrl":null,"url":null,"abstract":"Audio-video group emotion recognition is a challenging task and has attracted more attention in recent decades. Recently, deep learning models have shown tremendous advances in analyzing human emotion. However, due to its difficulties such as hard to gather a broad range of potential information to obtain meaningful emotional representations and hard to associate implicit contextual knowledge like humans. To tackle these problems, in this paper, we proposed the Local and Global Feature Aggregation based Multi-Task Learning (LGFAM) method to tackle the Group Emotion Recognition problem. The framework consists of three parallel feature extraction networks that were verified in previous work. After that, an attention network using MLP as a backbone with specially designed loss functions was used to fuse features from different modalities. In the experiment section, we present its performance on the EmotiW2023 Audio-Visual Group-based Emotion Recognition subchallenge which aims to classify a video into one of the three emotions. According to the feedback results, the best result achieved 70.63 WAR and 70.38 UAR on the test set. Such improvement proves the effectiveness of our method.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Companion Publication of the 2020 International Conference on Multimodal Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3577190.3616544","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Audio-video group emotion recognition is a challenging task and has attracted more attention in recent decades. Recently, deep learning models have shown tremendous advances in analyzing human emotion. However, due to its difficulties such as hard to gather a broad range of potential information to obtain meaningful emotional representations and hard to associate implicit contextual knowledge like humans. To tackle these problems, in this paper, we proposed the Local and Global Feature Aggregation based Multi-Task Learning (LGFAM) method to tackle the Group Emotion Recognition problem. The framework consists of three parallel feature extraction networks that were verified in previous work. After that, an attention network using MLP as a backbone with specially designed loss functions was used to fuse features from different modalities. In the experiment section, we present its performance on the EmotiW2023 Audio-Visual Group-based Emotion Recognition subchallenge which aims to classify a video into one of the three emotions. According to the feedback results, the best result achieved 70.63 WAR and 70.38 UAR on the test set. Such improvement proves the effectiveness of our method.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于多任务学习的基于局部和全局特征聚合的视听群体情感识别

音频-视频群体情感识别是一项具有挑战性的任务，近几十年来受到了越来越多的关注。最近，深度学习模型在分析人类情感方面取得了巨大进步。然而，由于其难以收集广泛的潜在信息以获得有意义的情感表征以及难以像人类那样将隐含的上下文知识联系起来等困难。为了解决这些问题，本文提出了基于局部和全局特征聚合的多任务学习(LGFAM)方法来解决群体情绪识别问题。该框架由三个并行特征提取网络组成，这些网络在之前的工作中得到了验证。在此基础上，利用MLP作为主干的注意力网络和专门设计的损失函数来融合不同模态的特征。在实验部分，我们展示了它在EmotiW2023基于视听组的情绪识别子挑战中的表现，该子挑战旨在将视频分类为三种情绪之一。根据反馈结果，在测试集上的最佳结果达到了70.63 WAR和70.38 UAR。这种改进证明了我们方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Companion Publication of the 2020 International Conference on Multimodal Interaction

自引率

0.00%

发文量