Exploring Multimodal Visual Features for Continuous Affect Recognition

Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge Pub Date : 2016-10-16 DOI:10.1145/2988257.2988270

Bo Sun, Siming Cao, Liandong Li, Jun He, Lejun Yu

引用次数: 28

Abstract

This paper presents our work in the Emotion Sub-Challenge of the 6th Audio/Visual Emotion Challenge and Workshop (AVEC 2016), whose goal is to explore utilizing audio, visual and physiological signals to continuously predict the value of the emotion dimensions (arousal and valence). As visual features are very important in emotion recognition, we try a variety of handcrafted and deep visual features. For each video clip, besides the baseline features, we extract multi-scale Dense SIFT features (MSDF), and some types of Convolutional neural networks (CNNs) features to recognize the expression phases of the current frame. We train linear Support Vector Regression (SVR) for every kind of features on the RECOLA dataset. Multimodal fusion of these modalities is then performed with a multiple linear regression model. The final Concordance Correlation Coefficient (CCC) we gained on the development set are 0.824 for arousal, and 0.718 for valence; and on the test set are 0.683 for arousal and 0.642 for valence.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

探索用于连续情感识别的多模态视觉特征

本文介绍了我们在第六届视听情感挑战与研讨会(AVEC 2016)的情感子挑战中的工作，其目标是探索利用音频，视觉和生理信号来持续预测情感维度(唤醒和效价)的值。由于视觉特征在情感识别中非常重要，我们尝试了各种手工制作的深度视觉特征。对于每个视频片段，除了提取基线特征外，我们还提取了多尺度密集SIFT特征(MSDF)和一些类型的卷积神经网络(cnn)特征来识别当前帧的表达阶段。我们对RECOLA数据集上的每一类特征训练线性支持向量回归(SVR)。然后用多元线性回归模型对这些模态进行多模态融合。发展集的最终一致性相关系数(CCC)为唤醒集0.824，效价集0.718;在测试集上，唤醒是0.683效价是0.642。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge

自引率

0.00%

发文量

期刊最新文献

Detecting Depression using Vocal, Facial and Semantic Communication Cues Multimodal Emotion Recognition for AVEC 2016 Challenge Staircase Regression in OA RVM, Data Selection and Gender Dependency in AVEC 2016 Session details: Depression recognition Depression Assessment by Fusing High and Low Level Features from Audio, Video, and Text