A Transpose-SELDNet for Polyphonic Sound Event Localization and Detection

2023 IEEE 8th International Conference for Convergence in Technology (I2CT) Pub Date : 2023-04-07 DOI:10.1109/I2CT57861.2023.10126251

S. V, S. Koolagudi

{"title":"A Transpose-SELDNet for Polyphonic Sound Event Localization and Detection","authors":"S. V, S. Koolagudi","doi":"10.1109/I2CT57861.2023.10126251","DOIUrl":null,"url":null,"abstract":"Human beings have the ability to identify a particular event occurring in a surrounding based on sound cues even when no visual scenes are presented. Sound events are the auditory cues that are present in a surrounding. Sound event detection (SED) is the process of determining the beginning and end of sound events as well as a textual label for the event. The term sound source localization (SSL) refers to the process of identifying the spatial location of a sound occurrence in addition to the SED. The integrated task of SED and SSL is known as Sound Event Localization and Detection (SELD). In this proposed work, three different deep learning architectures are explored to perform SELD. The three deep learning architectures are SELDNet, D-SELDNet (Depthwise Convolution), and T-SELDNet (Transpose Convolution). Two sets of features are used to perform SED and Direction-of-Arrival (DOA) estimation tasks in this work. D-SELDNet uses a Depthwise convolution layer which helps reduce the model’s complexity in terms of computation time. T-SELDNet uses Transpose Convolution, which helps in learning better discriminative features by retaining the input size and not losing necessary information from the input. The proposed method is evaluated on the First-order Ambisonic (FOA) array format of the TAU-NIGENS Spatial Sound Events 2020 dataset. An improvement has been observed as compared to the existing SELD systems with the proposed T-SELDNet.","PeriodicalId":150346,"journal":{"name":"2023 IEEE 8th International Conference for Convergence in Technology (I2CT)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE 8th International Conference for Convergence in Technology (I2CT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/I2CT57861.2023.10126251","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Human beings have the ability to identify a particular event occurring in a surrounding based on sound cues even when no visual scenes are presented. Sound events are the auditory cues that are present in a surrounding. Sound event detection (SED) is the process of determining the beginning and end of sound events as well as a textual label for the event. The term sound source localization (SSL) refers to the process of identifying the spatial location of a sound occurrence in addition to the SED. The integrated task of SED and SSL is known as Sound Event Localization and Detection (SELD). In this proposed work, three different deep learning architectures are explored to perform SELD. The three deep learning architectures are SELDNet, D-SELDNet (Depthwise Convolution), and T-SELDNet (Transpose Convolution). Two sets of features are used to perform SED and Direction-of-Arrival (DOA) estimation tasks in this work. D-SELDNet uses a Depthwise convolution layer which helps reduce the model’s complexity in terms of computation time. T-SELDNet uses Transpose Convolution, which helps in learning better discriminative features by retaining the input size and not losing necessary information from the input. The proposed method is evaluated on the First-order Ambisonic (FOA) array format of the TAU-NIGENS Spatial Sound Events 2020 dataset. An improvement has been observed as compared to the existing SELD systems with the proposed T-SELDNet.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于转置seldnet的复音事件定位与检测

即使没有视觉场景，人类也有能力根据声音线索识别周围发生的特定事件。声音事件是存在于周围环境中的听觉线索。声音事件检测(SED)是确定声音事件的开始和结束以及事件的文本标签的过程。声源定位(SSL)一词指的是在SED之外识别声音发生的空间位置的过程。SED和SSL的集成任务被称为声音事件定位和检测(SELD)。在本文中，我们探索了三种不同的深度学习架构来执行SELD。这三种深度学习架构分别是SELDNet、D-SELDNet(深度卷积)和T-SELDNet(转置卷积)。在这项工作中，两组特征用于执行SED和到达方向(DOA)估计任务。D-SELDNet使用深度卷积层，这有助于降低模型在计算时间方面的复杂性。T-SELDNet使用转置卷积，通过保留输入大小和不丢失输入的必要信息来帮助学习更好的判别特征。在TAU-NIGENS空间声事件2020数据集的一阶双声(FOA)阵列格式上对该方法进行了评估。与提出的T-SELDNet相比，已经观察到现有SELD系统的改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2023 IEEE 8th International Conference for Convergence in Technology (I2CT)

自引率

0.00%

发文量