SoundDoA: Learn Sound Source Direction of Arrival and Semantics from Sound Raw Waveforms

Yuhang He, A. Markham
{"title":"SoundDoA: Learn Sound Source Direction of Arrival and Semantics from Sound Raw Waveforms","authors":"Yuhang He, A. Markham","doi":"10.21437/interspeech.2022-378","DOIUrl":null,"url":null,"abstract":"A fundamental task for an agent to understand an environment acoustically is to detect sound source location (like direction of arrival (DoA)) and semantic label. It is a challenging task: firstly, sound sources overlap in time, frequency and space; secondly, while semantics are largely conveyed through time-frequency energy (amplitude) contours, DoA is encoded in inter-channel phase difference; lastly, although the number of microphone sensors are sparse, recorded sound waveform is temporally dense due to the high sampling rates. Existing methods for predicting DoA mostly depend on pre-extracted 2D acoustic feature such as GCC-PHAT and Mel-spectrograms so as to benefit from the success of mature 2D image based deep neural networks. We instead propose a novel end-to-end trainable framework, named SoundDoA , that is capable of learning sound source DoA and semantics directly from sound raw waveforms. We first use a learnable front-end filter bank to dynamically encode sound source semantics and DoA relevant features into a compact representation. A backbone network consisting of two identical sub-networks with layerwise communication strategy is then proposed to further learn semantic label and DoA both separately and jointly. Finally, a permutation invariant multi-track head is added to regress DoA and classify semantic label. Extensive experimental results on DCASE 2020 sound event detection and localization dataset (SELD) demonstrate the superiority of SoundDoA , when comparing with other existing methods.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2408-2412"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interspeech","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/interspeech.2022-378","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

A fundamental task for an agent to understand an environment acoustically is to detect sound source location (like direction of arrival (DoA)) and semantic label. It is a challenging task: firstly, sound sources overlap in time, frequency and space; secondly, while semantics are largely conveyed through time-frequency energy (amplitude) contours, DoA is encoded in inter-channel phase difference; lastly, although the number of microphone sensors are sparse, recorded sound waveform is temporally dense due to the high sampling rates. Existing methods for predicting DoA mostly depend on pre-extracted 2D acoustic feature such as GCC-PHAT and Mel-spectrograms so as to benefit from the success of mature 2D image based deep neural networks. We instead propose a novel end-to-end trainable framework, named SoundDoA , that is capable of learning sound source DoA and semantics directly from sound raw waveforms. We first use a learnable front-end filter bank to dynamically encode sound source semantics and DoA relevant features into a compact representation. A backbone network consisting of two identical sub-networks with layerwise communication strategy is then proposed to further learn semantic label and DoA both separately and jointly. Finally, a permutation invariant multi-track head is added to regress DoA and classify semantic label. Extensive experimental results on DCASE 2020 sound event detection and localization dataset (SELD) demonstrate the superiority of SoundDoA , when comparing with other existing methods.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
SoundDoA:从声音原始波形中学习声源到达方向和语义
智能体声学理解环境的一项基本任务是检测声源位置(如到达方向(DoA))和语义标签。这是一项具有挑战性的任务:首先,声源在时间、频率和空间上重叠;其次,虽然语义在很大程度上是通过时频能量(幅度)轮廓来传达的,但DoA是在信道间相位差中编码的;最后,尽管麦克风传感器的数量是稀疏的,但由于高采样率,记录的声音波形在时间上是密集的。现有的DoA预测方法大多依赖于预先提取的2D声学特征,如GCC-PHAT和Mel声谱图,以受益于成熟的基于2D图像的深度神经网络的成功。相反,我们提出了一种新的端到端可训练框架,名为SoundDoA,它能够直接从声音原始波形中学习声源DoA和语义。我们首先使用可学习的前端滤波器组将声源语义和DoA相关特征动态编码为紧凑表示。然后,提出了一个由两个相同的子网络组成的骨干网络,采用分层通信策略来进一步单独和联合学习语义标签和DoA。最后,添加了一个排列不变的多轨头来回归DoA并对语义标签进行分类。在DCASE 2020声音事件检测和定位数据集(SELD)上的大量实验结果表明,与其他现有方法相比,SoundDoA具有优越性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Contrastive Learning Approach for Assessment of Phonological Precision in Patients with Tongue Cancer Using MRI Data. Segmental and Suprasegmental Speech Foundation Models for Classifying Cognitive Risk Factors: Evaluating Out-of-the-Box Performance. How Does Alignment Error Affect Automated Pronunciation Scoring in Children's Speech? Comparing ambulatory voice measures during daily life with brief laboratory assessments in speakers with and without vocal hyperfunction. YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1