An Active Learning Paradigm for Online Audio-Visual Emotion Recognition

IF 9.8 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE IEEE Transactions on Affective Computing Pub Date : 2019-12-20 DOI:10.1109/TAFFC.2019.2961089

Ioannis Kansizoglou;Loukas Bampis;Antonios Gasteratos

{"title":"An Active Learning Paradigm for Online Audio-Visual Emotion Recognition","authors":"Ioannis Kansizoglou;Loukas Bampis;Antonios Gasteratos","doi":"10.1109/TAFFC.2019.2961089","DOIUrl":null,"url":null,"abstract":"The advancement of Human-Robot Interaction (HRI) drives research into the development of advanced emotion identification architectures that fathom audio-visual (A-V) modalities of human emotion. State-of-the-art methods in multi-modal emotion recognition mainly focus on the classification of complete video sequences, leading to systems with no online potentialities. Such techniques are capable of predicting emotions only when the videos are concluded, thus restricting their applicability in practical scenarios. This article provides a novel paradigm for online emotion classification, which exploits both audio and visual modalities and produces a responsive prediction when the system is confident enough. We propose two deep Convolutional Neural Network (CNN) models for extracting emotion features, one for each modality, and a Deep Neural Network (DNN) for their fusion. In order to conceive the temporal quality of human emotion in interactive scenarios, we train in cascade a Long Short-Term Memory (LSTM) layer and a Reinforcement Learning (RL) agent –which monitors the speaker– thus stopping feature extraction and making the final prediction. The comparison of our results on two publicly available A-V emotional datasets viz., RML and BAUM-1s, against other state-of-the-art models, demonstrates the beneficial capabilities of our work.","PeriodicalId":13131,"journal":{"name":"IEEE Transactions on Affective Computing","volume":"13 2","pages":"756-768"},"PeriodicalIF":9.8000,"publicationDate":"2019-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TAFFC.2019.2961089","citationCount":"54","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Affective Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/8937495/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 54

Abstract

The advancement of Human-Robot Interaction (HRI) drives research into the development of advanced emotion identification architectures that fathom audio-visual (A-V) modalities of human emotion. State-of-the-art methods in multi-modal emotion recognition mainly focus on the classification of complete video sequences, leading to systems with no online potentialities. Such techniques are capable of predicting emotions only when the videos are concluded, thus restricting their applicability in practical scenarios. This article provides a novel paradigm for online emotion classification, which exploits both audio and visual modalities and produces a responsive prediction when the system is confident enough. We propose two deep Convolutional Neural Network (CNN) models for extracting emotion features, one for each modality, and a Deep Neural Network (DNN) for their fusion. In order to conceive the temporal quality of human emotion in interactive scenarios, we train in cascade a Long Short-Term Memory (LSTM) layer and a Reinforcement Learning (RL) agent –which monitors the speaker– thus stopping feature extraction and making the final prediction. The comparison of our results on two publicly available A-V emotional datasets viz., RML and BAUM-1s, against other state-of-the-art models, demonstrates the beneficial capabilities of our work.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

一种在线视听情感识别的主动学习范式

人机交互（HRI）的进步推动了对高级情感识别架构的开发研究，该架构可以理解人类情感的视听（A-V）模式。最先进的多模态情感识别方法主要集中在对完整视频序列的分类上，导致系统没有在线潜力。这种技术只能在视频结束时预测情绪，从而限制了其在实际场景中的适用性。本文为在线情绪分类提供了一种新的范式，它利用音频和视觉模式，并在系统足够自信时产生响应预测。我们提出了两个用于提取情绪特征的深度卷积神经网络（CNN）模型，每个模型一个，并提出了一个用于融合的深度神经网络（DNN）。为了构想交互场景中人类情绪的时间质量，我们级联训练长短期记忆（LSTM）层和监控说话者的强化学习（RL）代理，从而停止特征提取并进行最终预测。将我们在两个公开可用的A-V情感数据集（即RML和BAUM-1s）上的结果与其他最先进的模型进行比较，证明了我们工作的有益能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Affective Computing COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, CYBERNETICS

CiteScore

15.00

自引率

6.20%

发文量

174

期刊介绍： The IEEE Transactions on Affective Computing is an international and interdisciplinary journal. Its primary goal is to share research findings on the development of systems capable of recognizing, interpreting, and simulating human emotions and related affective phenomena. The journal publishes original research on the underlying principles and theories that explain how and why affective factors shape human-technology interactions. It also focuses on how techniques for sensing and simulating affect can enhance our understanding of human emotions and processes. Additionally, the journal explores the design, implementation, and evaluation of systems that prioritize the consideration of affect in their usability. We also welcome surveys of existing work that provide new perspectives on the historical and future directions of this field.