{"title":"Adaptive Alignment and Time Aggregation Network for Speech-Visual Emotion Recognition","authors":"Lile Wu;Lei Bai;Wenhao Cheng;Zutian Cheng;Guanghui Chen","doi":"10.1109/LSP.2025.3550007","DOIUrl":null,"url":null,"abstract":"Video-based speech-visual emotion recognition plays a crucial role in human-computer interaction applications. However, it faces several challenges, including: 1) the redundancy in the extracted speech-visual features caused by the heterogeneity between speech and visual modalities, and 2) the ineffective modeling of the time-varying characteristics of emotions. To this end, this paper proposes an adaptive alignment and time aggregation network (AataNet). Specifically, AataNet designs a low redundancy speech-visual adaptive alignment (LRSVAA) module to acquire the low-redundant aligned features of speech-visual modalities. Meanwhile, AataNet also designs a computationally efficient time-adaptive aggregation (CETAA) module to model the time-varying characteristics of emotions. Experiments on RAVDESS, BAUM-1 s and eNTERFACE05 datasets also demonstrate that the proposed AataNet achieves better results.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"1181-1185"},"PeriodicalIF":3.2000,"publicationDate":"2025-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Signal Processing Letters","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10919089/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Video-based speech-visual emotion recognition plays a crucial role in human-computer interaction applications. However, it faces several challenges, including: 1) the redundancy in the extracted speech-visual features caused by the heterogeneity between speech and visual modalities, and 2) the ineffective modeling of the time-varying characteristics of emotions. To this end, this paper proposes an adaptive alignment and time aggregation network (AataNet). Specifically, AataNet designs a low redundancy speech-visual adaptive alignment (LRSVAA) module to acquire the low-redundant aligned features of speech-visual modalities. Meanwhile, AataNet also designs a computationally efficient time-adaptive aggregation (CETAA) module to model the time-varying characteristics of emotions. Experiments on RAVDESS, BAUM-1 s and eNTERFACE05 datasets also demonstrate that the proposed AataNet achieves better results.
期刊介绍:
The IEEE Signal Processing Letters is a monthly, archival publication designed to provide rapid dissemination of original, cutting-edge ideas and timely, significant contributions in signal, image, speech, language and audio processing. Papers published in the Letters can be presented within one year of their appearance in signal processing conferences such as ICASSP, GlobalSIP and ICIP, and also in several workshop organized by the Signal Processing Society.