为改进声音事件定位和检测而进行时空-频率联合表征学习

IF 3.4 3区综合性期刊 Q2 CHEMISTRY, ANALYTICAL Sensors Pub Date : 2024-09-20 DOI:10.3390/s24186090

Baoqing Chen, Mei Wang, Yu Gu

{"title":"为改进声音事件定位和检测而进行时空-频率联合表征学习","authors":"Baoqing Chen, Mei Wang, Yu Gu","doi":"10.3390/s24186090","DOIUrl":null,"url":null,"abstract":"Sound event localization and detection (SELD) is a crucial component of machine listening that aims to simultaneously identify and localize sound events in multichannel audio recordings. This task demands an integrated analysis of spatial, temporal, and frequency domains to accurately characterize sound events. The spatial domain pertains to the varying acoustic signals captured by multichannel microphones, which are essential for determining the location of sound sources. However, the majority of recent studies have focused on time-frequency correlations and spatio-temporal correlations separately, leading to inadequate performance in real-life scenarios. In this paper, we propose a novel SELD method that utilizes the newly developed Spatio-Temporal-Frequency Fusion Network (STFF-Net) to jointly learn comprehensive features across spatial, temporal, and frequency domains of sound events. The backbone of our STFF-Net is the Enhanced-3D (E3D) residual block, which combines 3D convolutions with a parameter-free attention mechanism to capture and refine the intricate correlations among these domains. Furthermore, our method incorporates the multi-ACCDOA format to effectively handle homogeneous overlaps between sound events. During the evaluation, we conduct extensive experiments on three de facto benchmark datasets, and our results demonstrate that the proposed SELD method significantly outperforms current state-of-the-art approaches.","PeriodicalId":21698,"journal":{"name":"Sensors","volume":null,"pages":null},"PeriodicalIF":3.4000,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11436190/pdf/","citationCount":"0","resultStr":"{\"title\":\"Joint Spatio-Temporal-Frequency Representation Learning for Improved Sound Event Localization and Detection.\",\"authors\":\"Baoqing Chen, Mei Wang, Yu Gu\",\"doi\":\"10.3390/s24186090\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Sound event localization and detection (SELD) is a crucial component of machine listening that aims to simultaneously identify and localize sound events in multichannel audio recordings. This task demands an integrated analysis of spatial, temporal, and frequency domains to accurately characterize sound events. The spatial domain pertains to the varying acoustic signals captured by multichannel microphones, which are essential for determining the location of sound sources. However, the majority of recent studies have focused on time-frequency correlations and spatio-temporal correlations separately, leading to inadequate performance in real-life scenarios. In this paper, we propose a novel SELD method that utilizes the newly developed Spatio-Temporal-Frequency Fusion Network (STFF-Net) to jointly learn comprehensive features across spatial, temporal, and frequency domains of sound events. The backbone of our STFF-Net is the Enhanced-3D (E3D) residual block, which combines 3D convolutions with a parameter-free attention mechanism to capture and refine the intricate correlations among these domains. Furthermore, our method incorporates the multi-ACCDOA format to effectively handle homogeneous overlaps between sound events. During the evaluation, we conduct extensive experiments on three de facto benchmark datasets, and our results demonstrate that the proposed SELD method significantly outperforms current state-of-the-art approaches.\",\"PeriodicalId\":21698,\"journal\":{\"name\":\"Sensors\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2024-09-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11436190/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Sensors\",\"FirstCategoryId\":\"103\",\"ListUrlMain\":\"https://doi.org/10.3390/s24186090\",\"RegionNum\":3,\"RegionCategory\":\"综合性期刊\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"CHEMISTRY, ANALYTICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sensors","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.3390/s24186090","RegionNum":3,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CHEMISTRY, ANALYTICAL","Score":null,"Total":0}

引用次数: 0

摘要

声音事件定位和检测（SELD）是机器听音的一个重要组成部分，旨在同时识别和定位多通道音频记录中的声音事件。这项任务要求对空间、时间和频率域进行综合分析，以准确描述声音事件的特征。空间域涉及多声道麦克风捕捉到的不同声学信号，对于确定声源位置至关重要。然而，最近的大多数研究都将重点分别放在时频相关性和时空相关性上，导致在实际场景中表现不佳。在本文中，我们提出了一种新颖的 SELD 方法，利用新开发的时空-频率融合网络（STFF-Net）来联合学习声音事件的空间、时间和频率域的综合特征。STFF 网络的支柱是增强三维（E3D）残差块，它将三维卷积与无参数注意机制相结合，以捕捉和完善这些域之间错综复杂的相关性。此外，我们的方法还结合了多 ACCDOA 格式，以有效处理声音事件之间的同质重叠。在评估过程中，我们在三个事实上的基准数据集上进行了广泛的实验，结果表明所提出的 SELD 方法明显优于目前最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Joint Spatio-Temporal-Frequency Representation Learning for Improved Sound Event Localization and Detection.

Sound event localization and detection (SELD) is a crucial component of machine listening that aims to simultaneously identify and localize sound events in multichannel audio recordings. This task demands an integrated analysis of spatial, temporal, and frequency domains to accurately characterize sound events. The spatial domain pertains to the varying acoustic signals captured by multichannel microphones, which are essential for determining the location of sound sources. However, the majority of recent studies have focused on time-frequency correlations and spatio-temporal correlations separately, leading to inadequate performance in real-life scenarios. In this paper, we propose a novel SELD method that utilizes the newly developed Spatio-Temporal-Frequency Fusion Network (STFF-Net) to jointly learn comprehensive features across spatial, temporal, and frequency domains of sound events. The backbone of our STFF-Net is the Enhanced-3D (E3D) residual block, which combines 3D convolutions with a parameter-free attention mechanism to capture and refine the intricate correlations among these domains. Furthermore, our method incorporates the multi-ACCDOA format to effectively handle homogeneous overlaps between sound events. During the evaluation, we conduct extensive experiments on three de facto benchmark datasets, and our results demonstrate that the proposed SELD method significantly outperforms current state-of-the-art approaches.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Sensors 工程技术-电化学

CiteScore

7.30

自引率

12.80%

发文量

8430

审稿时长

1.7 months

期刊介绍： Sensors (ISSN 1424-8220) provides an advanced forum for the science and technology of sensors and biosensors. It publishes reviews (including comprehensive reviews on the complete sensors products), regular research papers and short notes. Our aim is to encourage scientists to publish their experimental and theoretical results in as much detail as possible. There is no restriction on the length of the papers. The full experimental details must be provided so that the results can be reproduced.