Group-level Speech Emotion Recognition Utilising Deep Spectrum Features

Proceedings of the 2020 International Conference on Multimodal Interaction Pub Date : 2020-10-21 DOI:10.1145/3382507.3417964

Sandra Ottl, S. Amiriparian, Maurice Gerczuk, Vincent Karas, Björn Schuller

引用次数: 22

Abstract

The objectives of this challenge paper are two fold: first, we apply a range of neural network based transfer learning approaches to cope with the data scarcity in the field of speech emotion recognition, and second, we fuse the obtained representations and predictions in a nearly and late fusion strategy to check the complementarity of the applied networks. In particular, we use our Deep Spectrum system to extract deep feature representations from the audio content of the 2020 EmotiW group level emotion prediction challenge data. We evaluate a total of ten ImageNet pre-trained Convolutional Neural Networks, including AlexNet, VGG16, VGG19 and three DenseNet variants as audio feature extractors. We compare their performance to the ComParE feature set used in the challenge baseline, employing simple logistic regression models trained with Stochastic Gradient Descent as classifiers. With the help of late fusion, our approach improves the performance on the test set from 47.88 % to 62.70 % accuracy.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于深度频谱特征的群体级语音情感识别

本文的目标有两个方面:首先，我们应用一系列基于神经网络的迁移学习方法来应对语音情感识别领域的数据稀缺性;其次，我们将获得的表示和预测融合在一个近后期融合策略中，以检查应用网络的互补性。特别是，我们使用我们的深度频谱系统从2020年EmotiW组级情绪预测挑战数据的音频内容中提取深度特征表示。我们总共评估了10个ImageNet预训练的卷积神经网络，包括AlexNet、VGG16、VGG19和三个DenseNet变体作为音频特征提取器。我们将它们的性能与挑战基线中使用的compare特征集进行比较，使用随机梯度下降训练的简单逻辑回归模型作为分类器。在后期融合的帮助下，我们的方法将测试集的准确率从47.88%提高到62.70%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2020 International Conference on Multimodal Interaction

自引率

0.00%

发文量

期刊最新文献

OpenSense: A Platform for Multimodal Data Acquisition and Behavior Perception Human-centered Multimodal Machine Intelligence Touch Recognition with Attentive End-to-End Model MORSE: MultimOdal sentiment analysis for Real-life SEttings Temporal Attention and Consistency Measuring for Video Question Answering