{"title":"FA3-Net: feature aggregation and augmentation with attention network for sound event localization and detection","authors":"Chuan Wang, Qinghua Huang","doi":"10.1007/s10489-025-06437-x","DOIUrl":null,"url":null,"abstract":"<div><p>Sound event localization and detection (SELD) aims to identify the category and duration of sound events (SED) while also estimating their respective direction of arrival (DOA). This multi-task problem presents unique challenges, as the features required for SED and DOA tasks are not entirely aligned. Consequently, incomplete feature extraction and suboptimal feature fusion often hinder performance. To address these issues, we propose a feature aggregation and augmentation with attention network (FA3-Net). FA3-Net consists of two main components: the feature aggregation and augmentation with attention (FA3) module and the Conformer module. The FA3 module plays a critical role in fusing and enhancing high-level features, which is specifically designed to efficiently handle the distinct requirements of SED and DOA tasks. It ensures that task-specific features are extracted effectively, while also improving feature discriminability and reducing confusion. The feature aggregation residual block (FAResBlock), a component of the FA3 module, handles task-specific feature aggregation, while the feature augmentation with attention block (FAA block) enhances feature representation across multiple dimensions. The Conformer module is employed to model the temporal sequence, as it excels in capturing both local and global dependencies, making it ideal for comprehensive time sequence analysis. Finally, to overcome data limitations, audio channel swapping (ACS) is employed. Experiments on the STARSS23 dataset, DCASE2021 dataset and L3DAS22 dataset show that FA3-Net significantly outperforms other models in both feature aggregation and augmentation, while also being more efficient and lightweight. The code is available in: https://github.com/wangchuan11111111/FA3-NET</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 6","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Intelligence","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10489-025-06437-x","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Sound event localization and detection (SELD) aims to identify the category and duration of sound events (SED) while also estimating their respective direction of arrival (DOA). This multi-task problem presents unique challenges, as the features required for SED and DOA tasks are not entirely aligned. Consequently, incomplete feature extraction and suboptimal feature fusion often hinder performance. To address these issues, we propose a feature aggregation and augmentation with attention network (FA3-Net). FA3-Net consists of two main components: the feature aggregation and augmentation with attention (FA3) module and the Conformer module. The FA3 module plays a critical role in fusing and enhancing high-level features, which is specifically designed to efficiently handle the distinct requirements of SED and DOA tasks. It ensures that task-specific features are extracted effectively, while also improving feature discriminability and reducing confusion. The feature aggregation residual block (FAResBlock), a component of the FA3 module, handles task-specific feature aggregation, while the feature augmentation with attention block (FAA block) enhances feature representation across multiple dimensions. The Conformer module is employed to model the temporal sequence, as it excels in capturing both local and global dependencies, making it ideal for comprehensive time sequence analysis. Finally, to overcome data limitations, audio channel swapping (ACS) is employed. Experiments on the STARSS23 dataset, DCASE2021 dataset and L3DAS22 dataset show that FA3-Net significantly outperforms other models in both feature aggregation and augmentation, while also being more efficient and lightweight. The code is available in: https://github.com/wangchuan11111111/FA3-NET
期刊介绍:
With a focus on research in artificial intelligence and neural networks, this journal addresses issues involving solutions of real-life manufacturing, defense, management, government and industrial problems which are too complex to be solved through conventional approaches and require the simulation of intelligent thought processes, heuristics, applications of knowledge, and distributed and parallel processing. The integration of these multiple approaches in solving complex problems is of particular importance.
The journal presents new and original research and technological developments, addressing real and complex issues applicable to difficult problems. It provides a medium for exchanging scientific research and technological achievements accomplished by the international community.