GLFER-Net:基于全局-局部特征提取和重新校准的复调声源定位和检测网络

IF 1.7 3区 计算机科学 Q2 ACOUSTICS Eurasip Journal on Audio Speech and Music Processing Pub Date : 2024-06-26 DOI:10.1186/s13636-024-00356-4
Mengzhen Ma, Ying Hu, Liang He, Hao Huang
{"title":"GLFER-Net:基于全局-局部特征提取和重新校准的复调声源定位和检测网络","authors":"Mengzhen Ma, Ying Hu, Liang He, Hao Huang","doi":"10.1186/s13636-024-00356-4","DOIUrl":null,"url":null,"abstract":"Polyphonic sound source localization and detection (SSLD) task aims to recognize the categories of sound events, identify their onset and offset times, and detect their corresponding direction-of-arrival (DOA), where polyphonic refers to the occurrence of multiple overlapping sound sources in a segment. However, vanilla SSLD methods based on convolutional recurrent neural network (CRNN) suffer from insufficient feature extraction. The convolutions with kernel of single scale in CRNN fail to adequately extract multi-scale features of sound events, which have diverse time-frequency characteristics. It results in that the extracted features lack fine-grained information helpful for the localization of sound sources. In response to these challenges, we propose a polyphonic SSLD network based on global-local feature extraction and recalibration (GLFER-Net), where the global-local feature (GLF) extractor is designed to extract the multi-scale global features through an omni-directional dynamic convolution (ODConv) layer and multi-scale feature extraction (MSFE) module. The local feature extraction (LFE) unit is designed for capturing detailed information. Besides, we design a feature recalibration (FR) module to emphasize the crucial features along multiple dimensions. On the open datasets of Task3 in DCASE 2021 and 2022 Challenges, we compared our proposed GLFER-Net with six and four SSLD methods, respectively. The results show that the GLFER-Net achieves competitive performance. The modules we designed are verified to be effective through a series of ablation experiments and visualization analyses.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"94 1","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"GLFER-Net: a polyphonic sound source localization and detection network based on global-local feature extraction and recalibration\",\"authors\":\"Mengzhen Ma, Ying Hu, Liang He, Hao Huang\",\"doi\":\"10.1186/s13636-024-00356-4\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Polyphonic sound source localization and detection (SSLD) task aims to recognize the categories of sound events, identify their onset and offset times, and detect their corresponding direction-of-arrival (DOA), where polyphonic refers to the occurrence of multiple overlapping sound sources in a segment. However, vanilla SSLD methods based on convolutional recurrent neural network (CRNN) suffer from insufficient feature extraction. The convolutions with kernel of single scale in CRNN fail to adequately extract multi-scale features of sound events, which have diverse time-frequency characteristics. It results in that the extracted features lack fine-grained information helpful for the localization of sound sources. In response to these challenges, we propose a polyphonic SSLD network based on global-local feature extraction and recalibration (GLFER-Net), where the global-local feature (GLF) extractor is designed to extract the multi-scale global features through an omni-directional dynamic convolution (ODConv) layer and multi-scale feature extraction (MSFE) module. The local feature extraction (LFE) unit is designed for capturing detailed information. Besides, we design a feature recalibration (FR) module to emphasize the crucial features along multiple dimensions. On the open datasets of Task3 in DCASE 2021 and 2022 Challenges, we compared our proposed GLFER-Net with six and four SSLD methods, respectively. The results show that the GLFER-Net achieves competitive performance. The modules we designed are verified to be effective through a series of ablation experiments and visualization analyses.\",\"PeriodicalId\":49202,\"journal\":{\"name\":\"Eurasip Journal on Audio Speech and Music Processing\",\"volume\":\"94 1\",\"pages\":\"\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2024-06-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Eurasip Journal on Audio Speech and Music Processing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1186/s13636-024-00356-4\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Eurasip Journal on Audio Speech and Music Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1186/s13636-024-00356-4","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0

摘要

复调声源定位和检测(SSLD)任务旨在识别声音事件的类别、确定其起始和偏移时间,并检测其相应的到达方向(DOA),其中复调指的是在一个片段中出现多个重叠声源。然而,基于卷积递归神经网络(CRNN)的传统 SSLD 方法存在特征提取不足的问题。CRNN 中的卷积核为单尺度,无法充分提取具有不同时频特征的声音事件的多尺度特征。这导致提取的特征缺乏有助于声源定位的细粒度信息。为了应对这些挑战,我们提出了一种基于全局本地特征提取和重新校准的多声道 SSLD 网络(GLFER-Net),其中全局本地特征提取器(GLF)通过全向动态卷积(ODConv)层和多尺度特征提取(MSFE)模块提取多尺度全局特征。局部特征提取(LFE)单元用于捕捉细节信息。此外,我们还设计了一个特征重新校准(FR)模块,以强调多个维度上的关键特征。在 DCASE 2021 年和 2022 年挑战赛任务 3 的开放数据集上,我们将所提出的 GLFER-Net 分别与六种和四种 SSLD 方法进行了比较。结果表明,GLFER-Net 的性能极具竞争力。通过一系列消融实验和可视化分析,我们验证了所设计模块的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
GLFER-Net: a polyphonic sound source localization and detection network based on global-local feature extraction and recalibration
Polyphonic sound source localization and detection (SSLD) task aims to recognize the categories of sound events, identify their onset and offset times, and detect their corresponding direction-of-arrival (DOA), where polyphonic refers to the occurrence of multiple overlapping sound sources in a segment. However, vanilla SSLD methods based on convolutional recurrent neural network (CRNN) suffer from insufficient feature extraction. The convolutions with kernel of single scale in CRNN fail to adequately extract multi-scale features of sound events, which have diverse time-frequency characteristics. It results in that the extracted features lack fine-grained information helpful for the localization of sound sources. In response to these challenges, we propose a polyphonic SSLD network based on global-local feature extraction and recalibration (GLFER-Net), where the global-local feature (GLF) extractor is designed to extract the multi-scale global features through an omni-directional dynamic convolution (ODConv) layer and multi-scale feature extraction (MSFE) module. The local feature extraction (LFE) unit is designed for capturing detailed information. Besides, we design a feature recalibration (FR) module to emphasize the crucial features along multiple dimensions. On the open datasets of Task3 in DCASE 2021 and 2022 Challenges, we compared our proposed GLFER-Net with six and four SSLD methods, respectively. The results show that the GLFER-Net achieves competitive performance. The modules we designed are verified to be effective through a series of ablation experiments and visualization analyses.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Eurasip Journal on Audio Speech and Music Processing
Eurasip Journal on Audio Speech and Music Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC
CiteScore
4.10
自引率
4.20%
发文量
0
审稿时长
12 months
期刊介绍: The aim of “EURASIP Journal on Audio, Speech, and Music Processing” is to bring together researchers, scientists and engineers working on the theory and applications of the processing of various audio signals, with a specific focus on speech and music. EURASIP Journal on Audio, Speech, and Music Processing will be an interdisciplinary journal for the dissemination of all basic and applied aspects of speech communication and audio processes.
期刊最新文献
Compression of room impulse responses for compact storage and fast low-latency convolution Guest editorial: AI for computational audition—sound and music processing Physics-constrained adaptive kernel interpolation for region-to-region acoustic transfer function: a Bayesian approach Physics-informed neural network for volumetric sound field reconstruction of speech signals Optimal sensor placement for the spatial reconstruction of sound fields
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1